磨刀不误砍柴工: July 2010

Sunday, July 18, 2010

IO笔记3

read(2)和write(2)都有一个特点: 期待被处理的数据与已经被处理的数据量可以不一致.
比如:
char buffer[5000];
size_t count = read(1, buffer, 4096);

调用read读4096个bytes, 但是是不是真的已经读了4096个bytes, 这是没有必然要求的.

socket的I/O是最明显的.

如果不一致, 就要求程序员去处理. 不过, 这都是常事了, 都有满足这样需求的函数.

但是, 这就需要程序员可以自己分清这两种情况的函数.

不要认为sys.stdin.read(4096)和os.read(sys.stdin.fileno(), 4096)是一样的.

$ (echo -e "12345\n"; sleep 10) | python -c '
import sys
print sys.stdin.read(4096)'

上面的python程序是被block了10秒才运行print的. 但是下面的是马上收到数据的:
$ (echo -e "12345\n"; sleep 10) | python -c '
import sys
import os
print os.read(sys.stdin.fileno(), 4096)'

C的standard I/O是一组保证可以返回指定数据(量)的函数

standard I/O增加了buffer, 以减少read, write的次数. 原则就是每次调用read的时候读入BUF_SIZE的数据到buffer空间中, 下面是read函数的原型:

ssize_t read(int fd, void *buf, size_t count);

count是会根据file descriptor的不同变化, 如:
1. terminate device时为1024
2. 不是terminate device时为是page size, 或者程序员使用setvbuf指定

standard I/O有三个种非格式化函数:
1. character-at-a-time I/O
2. line-at-a-time I/O
3. direct I/O

无论使用哪种函数, 它都是如下的方式:

read(fd, buffer_ptr, BUF_SIZE);

那怕你只是想使用fgetc得到一个字符. 都是调用上面的函数. 一般情况BUF_SIZE >= 1024.

standard I/O除了缓存外. 还保证了返回的数据是想得到的, 如:
1. character-at-a-time I/O时, 保证返回一个字符
2. line-at-a-time I/O时, 保证返回一行数据, 以NULL结尾.
3. direct I/O时, 保证返回指定的数据结构.

否则, 会多次调用上面的read函数, 直到取得的数据满足要求, 过程中会出来下面的情况:
1. block;
2. EOF;
3. Error

Friday, July 16, 2010

IO笔记2

两层buffer的存在

23 #include <stdio.h>
24
25 int main(int argc, char *argv[]){
26     int retval ;
27     while(1){
28         char c[BUFSIZ];
29         retval = read(0, c, 1);
30         printf("%c\n", *c);
31     }
32     return 0;
33 }

上面的代码，用户输入一次， read函数被调用N次(数据的长度)。也就是说用户输入的数据已经被缓存了。
再如下：

23 #include <stdio.h>
24
25 int main(int argc, char *argv[]){
26     int retval ;
27     int i;
28     int c;
29     printf("%d\n", BUFSIZ);
30     for(i = 0; i < BUFSIZ / 2; i++){
31         c = fgetc(stdin);
32     }
33     return 0;
34 }

上面的代码fgetc被调用N次(数据的长度), 但是read的调用情况如下:
1. 当stdin不是terminate device时, 被调用一次, read的count参数为4096
2. 当stdin是terminate device时, 被调用两次, read的count参数为1024
* 至少怎么得到这种数据, 答案是使用strace凶器。

libc会根据stdin是什么使用setvbuf对FILE对象设置buffer.

从上面两个例子可以知道：
1. 使用standard I/O从terminate device读数据时，其实经过了两个buffer机制。
2. standard I/O读不同file descriptor时的buffer大小会变化。加上从APUE中知道的，情况会是这样
* 当stdin, stdout是terminate device时， standard I/O会使用行缓存， buffer大小为1024.
* 当stdin, stdout不是terminate device时， standard I/O会使用full buffer, buffer大小为4096.

好，知道这两种buffer机制与buffer大小了，移步看这一篇文章会有很好的收获：
* http://www.pixelbeat.org/programming/stdio_buffering/

这文章从一个例子开始, 如下:
# tail -f access.log | cut -d' '|uniq
为什么上面的命令没有输出? 呵呵, 其中就是IO缓存的机制.

预读数据, 减少read, write

使用standard I/O都是会预读一定数量的数据, 用于减少调用read, write的次数.
使用standard I/O的好处就是程序员无需操心这个预读数据的大小, 只需要知道它是预读的即可.
如果不知道会预读的话, 上面的URL的例子就无法明白.

Sunday, July 11, 2010

IO笔记

IO部分总是很乱。不知道是因为API多还是其它原因：
0. 阻塞的存在
1. glibc的section3手册和linux的section2有同名函数。
2. posix, ISO都有要求
3. terminal IO与其它IO有不同的表现
4. IO缓存的机制
5. 还有众多古老的名词....

EOF
EOF是什么? 什么时候才会产生? 还是和ctrl_c一样是signal ?

如下代码的运行結果是 -1
19 #include <stdio.h>
20
21 int
22 main(int argc, char *argv[])
23 {
24     printf("%d\n", EOF);
25     return 0;
26
27 }

表示EOF是 -1 ?

当某一次使用 ssize_t read(int fd, void *buf, size_t count); 函数从一个file descriptor中读数据时(count>0), 函数得到0个字节表示读到文件的尾部了. 这时read函数返回0.

可读端的pipe关闭后，只读端pipe在读完后会产生EOF：
2 import os
3 import sys
4 import time
5
6 read_end, write_end = os.pipe()
7
8 pid = os.fork()
9 if pid < 0:
10     print "Error"
11     sys.exit(1)
12 if pid == 0:
13     # parent
14     os.close(write_end)
15     try_time = 1
16     # 因为无法得到child process会发多少数据，只能使用死循环去读数据
17     while True:
18         print "%s time(s) call read" % try_time
19         content = os.read(read_end, 1)
20         if not content:
21             break
22         os.write(1, content)
23         try_time += 1
24     sys.exit(0)
25
26 else:
27     # child
28     os.close(read_end)
29     os.write(write_end, "#" * 100)
30     # 关了parent就能读到EOF
31     os.close(write_end)
32     sys.exit(0)

文件系统中的文件是有固定大小(某一状态), 但是在terminal IO里，是没有一个固定的尾部的, 需要输入者指定什么时候才是end-of-file, 所以, ctrl_D出现了. ctrl_D是一个特殊控制符。被tty的驱动处理(默认情况下用户程序不能从read函数得到，除非要求tty不处理)

在了解ctrl_D前, 应该先要了解IO的缓存机制.

standard IO缓存与底层IO缓存
在此这前, 我对getchar这个函数不解, 其实就是不解如下的代码:
19 #include <stdio.h>
20
21 int main(int argc, char *argv[]){
22     char c ;
23     while(c = getchar() ){
24        printf("%c\n", c);
25     }
26 }

不要以为简单调用getchar函数可以实现vim这种交互.
当调用read去读stdin的数据时, 是行缓存的.上面的代码在第一次调用getchar时把一行的数据读到buffer里(回车键返回), 下次再调用getchar时会从buffer里取. buffer没有数据后再等待用户输入.

标准IO与系统底层IO这两套IO需要分开. 不然会出现事与愿违的情况, 如:
使用setbuf对FILE对象进行了设置后, 然后使用调用printf操作. 結果没有出现想要的设置效果。例如：

23 #include <stdio.h>
24 #include <string.h>
25 #include <unistd.h>
26
27
28 int main(int argc, char *argv[]){
29     //
30     FILE *input = fdopen(1, "r");
31     FILE *output = fdopen(1, "w");
32     setbuf(input, NULL);
33     setbuf(output, NULL);
34     char c ;
35
36     printf("Hello World");
37     sleep(10);
38 }

上面的printf的内容是过了10秒后才打印的，明显， setbuf没有起作用。但是如果修改为如下就不同了：
23 #include <stdio.h>
24 #include <string.h>
25 #include <unistd.h>
26
27
28 int main(int argc, char *argv[]){
29     //
30     FILE *input = fdopen(1, "r");
31     FILE *output = fdopen(1, "w");
32     setbuf(input, NULL);
33     setbuf(output, NULL);
34     char c ;
35
36     fprintf(output, "Hello World");
37     sleep(10);
38 }
上面代码的区别在于：printf使用了stdin这个FILE对象(默认是line buffer)， fprintf使用了指定的FILE对象。standard IO的缓存是通过修改FILE对象实现的。

read和write会不会缓存，缓存机制是什么？这与read, write操作的对象有很大的关系，例如：

28 int main(int argc, char *argv[]){
29
30     while(1){
31         char c;
32         (void *)read(0, &c, 1);
33         (void *)write(1, &c, 1);
34     }
35 }
上面虽然使用了read, write，还是用户输入还是行缓存的，无法像vim一样交互操作。这时要想非行缓存，需要操作到tty的缓存机制。如下的一段python代码设置了tty的缓存方式：

1 #coding:utf-8
2 import os
3 import sys
4 import termios
5
6 STDIN_FILENO = sys.stdin.fileno()
7
8 old_attr = termios.tcgetattr(STDIN_FILENO)
9 new_attr = termios.tcgetattr(STDIN_FILENO)
10
11 new_attr[3] &= ~ (termios.ICANON | termios.ECHO)
12 termios.tcsetattr(STDIN_FILENO, termios.TCSADRAIN, new_attr)
13 try:
14     print '请使用vim的移动键'
15     while True:
16         c= os.read(STDIN_FILENO, 1)
17         if c == 'j':
18             print "下"
19         elif c == 'k':
20             print "上"
21         elif c == 'h':
22             print '左'
23         elif c == 'l':
24             print '右'
25 except KeyboardInterrupt, e:
26     termios.tcsetattr(STDIN_FILENO, termios.TCSADRAIN, old_attr)

回到ctrl_D的问题
在terminal IO里, 因为默认是使用缓存的, 赋于ctrl_D两种功能:
1. 相当于flush操作. 即时把输入的数据返回给read函数，不用等待'\n'的出现。
2. 让tty返回0字节给read函数. 表示EOF。在单独输入ctrl_D后产生这种效果

控制符
控制符都是tty驱动提供的一种方便功能，有少量的控制符会产生signal的：
^Z和^C
还有少量的是无法修改的：
\r和\n

如下代码把“a“作为backspace功能：

1 #coding:utf-8
2 import os
3 import sys
4 import termios
5
6 STDIN_FILENO = sys.stdin.fileno()
7
8 old_attr = termios.tcgetattr(STDIN_FILENO)
9 new_attr = termios.tcgetattr(STDIN_FILENO)
10
11 new_attr[6][termios.VERASE] = 0x61
12 termios.tcsetattr(STDIN_FILENO, termios.TCSADRAIN, new_attr)
13 try:
14     print "请使用a键删除数据"
15     os.read(0,100)
16 except KeyboardInterrupt, e:
17     termios.tcsetattr(STDIN_FILENO, termios.TCSADRAIN, old_attr)

磨刀不误砍柴工