磨刀不误砍柴工

Thursday, October 14, 2010

devfs与udev

/dev 目录disk-based还是kernel-based的文件系统?
这个问题一直没有分开.
从自己的实践和<<Linux操作系统之奥秘>>一书, 明显/dev是disk-based的. 本人没有使用过devfs

linux2.4 kernel时代使用devfs文件系统. linux 2.6 kernel已经去掉devfs代码了.

想查查devfs的文档都不容易, 它的作者旧blog上的文章都找不到了.

google到
* http://www.linuxjournal.com/article/6035
* http://www.ibm.com/developerworks/linux/library/l-devfs.html
从时间上看很老的, 为了确认年代, 查看了kernel的timeline, 可以确认这是"同一时代"的产物:
* http://en.wikipedia.org/wiki/Linux_kernel#Timeline

没有用过devfs. 为了求真相, 问朋友借了个redhat8环境的linux, 结果devfs没有被默认编入kernel:
# cat /boot/config-2.4.18-14 |grep DEVFS
# CONFIG_DEVFS_FS is not set

但是redhat8下的/dev的确有18 thousand entries (比较惊人)
/dev 目录disk-based还是kernel-based的文件系统? 这个问题没有环境都没法亲眼看到. 主能找到的文档里找, 有些文件是使用pseudo filesystem称呼devfs的(例如这篇: http://www.linux.org/docs/ldp/howto/SCSI-2.4-HOWTO/devfs.html)

在 http://www.linuxjournal.com/article/6035 一文中, 描述了使用devfs的好处:
1. 系统自动管理/dev下的文件
2. 可以被read only 的方法mount到系统和/dev创建在non-unix file system上

文档特殊提到了non-unix file system. 因为dev entry是还有一些额外的信息. 下面在fat32文件系统上创建一个dev entry:

[jessinio@niowork NO_NAME]$ sudo mknod dev_entry c 240 1
mknod: `dev_entry': Operation not permitted

可以确定devfs一个存在于内存的文件系统.

devfs已经被淘汰, 被udev取代. udev比devfs带来什么好处?
如果想仔细了解有什么优点, 请阅读udev作者写的文章: http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev_vs_devfs

比较深刻的是两条:
1. 可以随意命名设备文件在/dev下的entry名
2. allow everyone to not care about major/minor numbers

到这里, 又需要把视线转移到sysfs上面了.

Monday, October 11, 2010

足够数目的getdents调用与文件数目引发的问题

网站的速度很慢。要求给个理由。于是登机top了一把。如下

PID USER      PR NI VIRT RES SHR S %CPU %MEM    TIME+ COMMAND
29131 liangqin 16   0 15068 3444 816 R 11.7 0.0   0:00.72 top
16371 kmmaster 16   0 181m 8484 3128 D 5.2 0.1   0:01.20 httpd
5726 kmmaster 15   0 182m 9352 3564 S 3.9 0.1   0:02.24 httpd
32548 kmmaster 16   0 183m 9.8m 3348 D 3.9 0.1   0:06.16 httpd
6199 kmmaster 16   0 183m 9444 3452 D 2.6 0.1   0:01.89 httpd
7697 kmmaster 16   0 182m 8848 3184 D 2.6 0.1   0:01.94 httpd
10536 kmmaster 16   0 181m 8692 3332 D 2.6 0.1   0:01.93 httpd
15102 kmmaster 16   0 181m 8420 3060 D 2.6 0.1   0:01.37 httpd
17993 kmmaster 16   0 181m 8708 3292 D 2.6 0.1   0:04.16 httpd
23185 kmmaster 16   0 182m 9420 3428 D 2.6 0.1   0:00.61 httpd
30189 kmmaster 16   0 181m 8724 3308 D 2.6 0.1   0:05.91 httpd
30337 kmmaster 16   0 183m 9.9m 3448 D 2.6 0.1   0:02.71 httpd

使用strace命令查看D状态的httpd, 都是调用getdents, stat, unlink等IO函数，如：
stat("/tmp/sess_bc45d5d1dd8739acceff8a3e0fec0585", {st_mode=S_IFREG|0600, st_size=0, ...}) = 0

使用ls， find之类的工具都无法对此目录(指/tmp)进行数据查看。行为freeze。也是进入D状态。

本想使用python的os.listdir函数的。但这个函数是读完目录的entry后才返回list的。也是要慢很长时间。

于是使用下面的C代码：
#include <sys/types.h>
#include <dirent.h>
#include <stdio.h>

int main(int argc, char * argv[])
{
        DIR *dirp = opendir("/tmp");
        struct dirent *retval;
        long long int t;
        for(; ; ){
                retval = readdir(dirp);
                if (retval == NULL) {break;}
                else {printf("%s\n", retval->d_name); t++;}
        }
        printf("%lld\n", t);
}
可以知道目录文件总数是 221346

httpd的进程数有3K！每个需要session的进程都要读/tmp目录下的entry，这个行为会对3K数目的httpd进程有很大的影响吗？

于是自己写了个测试代码，目的就是测试众多的readdir函数是否对进程有影响：
#coding:utf-8
import time
import os

for i in range(300):
    pid = os.fork()
    if pid > 0:
        break

for r in range(30):
    time.sleep(0.1)
    os.listdir("/tmp")

代码很简单，但是对于200K条目的directory来说，很给力！！出现大量的D状态进程：
(........被截去......)
905       7072 0.1 0.0 84244 11628 pts/2    D+   14:45   0:00 python listdir.py
905       7073 0.1 0.1 85136 12448 pts/2    D+   14:45   0:00 python listdir.py
905       7074 0.1 0.1 85136 12628 pts/2    D+   14:45   0:00 python listdir.py
905       7075 0.1 0.1 86304 13588 pts/2    D+   14:45   0:00 python listdir.py
905       7076 0.1 0.0 84764 12264 pts/2    D+   14:45   0:00 python listdir.py
905       7077 0.1 0.0 84504 12036 pts/2    D+   14:45   0:00 python listdir.py
905       7078 0.1 0.1 85136 12672 pts/2    D+   14:45   0:00 python listdir.py
905       7079 0.1 0.1 85916 13360 pts/2    D+   14:45   0:00 python listdir.py
905       7080 0.1 0.1 85916 13448 pts/2    D+   14:45   0:00 python listdir.py
905       7081 0.2 0.1 89728 16964 pts/2    D+   14:45   0:00 python listdir.py
905       7082 0.2 0.1 88268 15776 pts/2    D+   14:45   0:00 python listdir.py
905       7083 0.2 0.1 88268 15596 pts/2    D+   14:45   0:00 python listdir.py
905       7084 0.2 0.1 87344 14820 pts/2    D+   14:45   0:00 python listdir.py
905       7085 0.1 0.1 85136 12628 pts/2    D+   14:45   0:00 python listdir.py
905       7086 0.2 0.1 86304 13812 pts/2    D+   14:45   0:00 python listdir.py
(......被截去.......)

机器的内存使用量快速上升。

Wednesday, October 6, 2010

broadcast

很少使用UDP协议和原始数据包，所以对broadcast这种特殊地址使用不多。
知识总是关联在一起的。今天在看LVS-DR模式的配置时，发现对下面的配置有些不理解：

# ifconfig lo:0 IP_Adress broadcast IP_Adress netmask 255.255.255.255 up
# route add -host IP_Adress dev lo:0
* IP_Adress为IP地址。

如果只是为了配置LVS的话，就不需要理会上面的语句背后的原理，但作为技术控，很希望可以知道它背后后原理。

結果发现背后是好大一个坑，最经典的问题就是LVS的ARP问题：
* http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.arp_problem.html
都与broadcast有关系. 先对broadcast下手:

为什么 broadcast与IP_Adress相同，而不是常用的172.16.2.255这种特殊IP？
首先， broadcast有如下几种：
1. layer 2 broadcast
2. layer 3 broadcast
3. unicast
4. multicast

要知道broadcast的作用是"一对多", 一台机器发出的数据多台机器有兴趣接收. 这种特点是TCP协议没有的.
下面使用UDP协议来举个layer 3的例子:
接收端(调用bind函数), 这一端的机器可以多台:
import socket
import sys
x = ('<broadcast>', 51423)

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
s.bind(x)
(buf, address) = s.recvfrom(2048)
s.sendto("Hi", address)

发送端(调用send函数):
import socket
import sys
x = ('<broadcast>', 51423)

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
s.sendto("Hi", x)
(buf, address) = s.recvfrom(2048)
print "Received from %s: %s" % (address, buf)

send端使用的'<broadcast>' 很另类: 不是使用具体的IP地址, 而是使用代名词.
回到上面的ifconfig例子, 这个'<broadcast>' 其实就是NIC里的broadcast参数.
如果NIC的参数不同, '<broadcast>' 代表的意义就不同.
从发送端看:
send函数发出的数据包里的destination地址为'<broadcast>'
从接收端看:
recvfrom函数只接收destination地址为NIC里'<broadcast>' 参数的广播包.

所以, 上面的ifconfig设置明显是不想服务器接收layer 3的广播信息( 例如destination为192.168.0.255这种数据包)

从netmask的角度可以这样思考:
ifconfig lo:0 192.168.0.10 broadcast 192.168.0.10 netmask 255.255.255.255可以变形成ifconfig lo:0 192.168.0.10 netmask 255.255.255.255
它们与
ifconfig lo:0 192.168.0.10 netmask 255.255.255.0
是属于不同的subnet, 所以192.168.0.0/24的信息对于网段192.168.0.0/32是不会接收的

这种单一的广播地址被称为 unicast
broadcast还有layer 2的. 典型的例子就是arp协议. 使用的以太网广播地址: FF.FF.FF.FF.FF.FF 作为destination

linux下没有CLI接口的命令可以发出arp请求包. 因为arp功能放在kernel中(可以看手册man 7 arp), layer 3的数据压到layer 2时kernel为自动调用arp请求包(如果是需求的话).
如果要手动发出这种请求也是可以的, 比如这段代码: http://svn.pythonfr.org/public/pythonfr/utils/network/arp-flood.py

当ping一个IP时, 系统的arp表示里没有与IP对应的条目时kernel是会发出arp请求包的, 所以为了测试, 可以在清除arp条目的情况下,在两机之间ping对方.
如下代码:
import socket
soc = socket.socket(socket.PF_PACKET, socket.SOCK_RAW) #create the raw-socket
soc.bind(("wlan0",0x0806)) # ether type for ARP
data = soc.recv(1024)

程序属于接收端, 程序运行后会一直block, 直到接收到一个arp请求. 向子网的所有机器询问的MAC, 所以是"一对多", 这时就需要使用到广播地址, layer 2的广播地址为FF.FF.FF.FF.FF.FF.

与layer 3相比, layer 2的广播地址不是在NIC上配置的, 但是LVS-DR模式又希望real server不要响应和发出arp请求, 于是, LVS-DR的arp问题就产生了. 也就是文章开头的route add命令的原理. 至于这个原理这里就不写了. 认真阅读下面的文章即可, 下面的文章涉及了linux kernel的多个版本, 如2.0.x, 2.2.x, 2.6.x.
http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.arp_problem.html

Monday, October 4, 2010

route table

平时在了解系统的route表和配置系统的route表时, 都是比较喜欢使用route这个命令的. 但是, 本人感觉到此命令输出的結果与一些讲解系统网络的资料对不上号的, 例如:

在Routing Decision处就和route命令的输出結果对不上号:
jessinio@jessinio-laptop:~$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.0.0     0.0.0.0         255.255.255.0   U     2      0        0 wlan0
169.254.0.0     0.0.0.0         255.255.0.0     U     1000   0        0 wlan0
0.0.0.0         192.168.0.1     0.0.0.0         UG    0      0        0 wlan0

結果很明显, route输出的結果只是系统发向外面的数据包的routing, 没有陈述数据包进入系统的routing.

这种感觉待续了很久一段时间. 今天看到这样一段话:

Linux has a different approach for routing than other UNIX. The way things are implemented on Linux is more flexible and powerful than traditional ways. Legacy utilities such as ifconfig and route are still valid, but incomplete. This is because they do not give access to the advanced routing layer present on Linux. The utility ip (part of iproute2) is the current tool for networking related stuff under Linux. This tool will be the focus of this section.

文字中提到了,route命令得到的其实是不完全的数据. 在Linux下, 最本地化的应该是ip命令.
google找到一份很老的文档： http://linux-ip.net/html/routing-tables.html 上面写得很清楚了, 如下一小段：

The routing table manipulated by the conventional route command is the main routing table. Additionally, the use of both ip address and ifconfig will cause the kernel to alter the local routing table (and usually the main routing table). For further documentation on how to manipulate the other routing tables, see the command description of ip route.

route命令得到和设置的仅仅是冰山一角。

iproute2工具集手册：http://www.policyrouting.org/iproute2.doc.html

Saturday, September 11, 2010

block size

多处都存在block size，概念同名但是意义不同，相当迷惑
这哥们就是一个被迷惑的人： http://www.linuxforums.org/forum/misc/5654-linux-disk-block-size-help-please.html

上面的URL列出了如下几种block:
1. Hardware block size, "sector size"
2. Filesystem block size, "block size"
3. Kernel buffer cache block size, "block size"
4. Partition table block size, "cylinder size"

我对fdisk打印的blocks一栏很不解. 需要dig一下.
先来看看fdisk打印出来的block：

jessinio@jessinio-laptop:/ $ sudo fdisk -l
Disk /dev/sda: 250.1 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00038329

   Device Boot      Start         End      Blocks   Id System
/dev/sda1   *           1         100      803218+ 83 Linux
/dev/sda2             101       30401   243392782+ 8e Linux LVM

上面的内容和下面的是一致的：
jessinio@jessinio-laptop:/media/82d236f2-3592-4040-801c-3c2049ddfb95$ sudo fdisk -b 512 -l
Warning: the -b (set sector size) option should be used with one specified device
Disk /dev/sda: 250.1 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00038329

   Device Boot      Start         End      Blocks   Id System
/dev/sda1   *           1         100      803218+ 83 Linux
/dev/sda2             101       30401   243392782+ 8e Linux LVM

但是下面的就比较奇怪了：
jessinio@jessinio-laptop:/media/82d236f2-3592-4040-801c-3c2049ddfb95$ sudo fdisk -b 1024 -l
Warning: the -b (set sector size) option should be used with one specified device
Disk /dev/sda: 250.1 GB, 250059350016 bytes
255 heads, 63 sectors/track, 15200 cylinders
Units = cylinders of 16065 * 1024 = 16450560 bytes
Disk identifier: 0x00038329

   Device Boot      Start         End      Blocks   Id System
/dev/sda1   *           1         100     1606437   83 Linux
/dev/sda2             101       30401   486785565   8e Linux LVM

指定更大的硬件sector size反而block增加, 这是为什么呢?。下面是fdisk的相关代码：
sector_size变量的来源：
759 static void
760 get_sectorsize(int fd) {
761 #if defined(BLKSSZGET)
762     if (!user_set_sector_size &&
763         linux_version_code() >= MAKE_VERSION(2,3,3)) {
764         int arg;
765         if (ioctl(fd, BLKSSZGET, &arg) == 0)
766             sector_size = arg;
767         if (sector_size != DEFAULT_SECTOR_SIZE)
768             printf(_("Note: sector size is %d (not %d)\n"),
769                    sector_size, DEFAULT_SECTOR_SIZE);
770     }
771 #else
772     /* maybe the user specified it; and otherwise we still
773        have the DEFAULT_SECTOR_SIZE default */
774 #endif
775 }
DEFAULT_SECTOR_SIZE 在fdisk.h中定义是
#define DEFAULT_SECTOR_SIZE     512

或者是用户指定的user_set_sector_size：sector_size = atoi(optarg);

打印时使用的代码是：
1731             unsigned int psects = get_nr_sects(p);
1732             unsigned int pblocks = psects;
1733             unsigned int podd = 0;
1734
1735             if (sector_size < 1024) {
1736                 pblocks /= (1024 / sector_size);
1737                 podd = psects % (1024 / sector_size);
1738             }
1739             if (sector_size > 1024)
1740                 pblocks *= (sector_size / 1024);
1741                         printf(
1742                 "%s %c %11lu %11lu %11lu%c %2x %s\n",
1743             partname(disk_device, i+1, w+2),
1744 /* boot flag */     !p->boot_ind ? ' ' : p->boot_ind == ACTIVE_FLAG
1745             ? '*' : '?',
1746 /* start */     (unsigned long) cround(get_partition_start(pe)),
1747 /* end */       (unsigned long) cround(get_partition_start(pe) + psects
1748                 - (psects ? 1 : 0)),
1749 /* odd flag on end */   (unsigned long) pblocks, podd ? '+' : ' ',
1750 /* type id */       p->sys_ind,
1751 /* type name */     (type = partition_type(p->sys_ind)) ?
1752             type : _("Unknown"));
1753             check_consistency(p, i);

1. 当sector_size刚好等于1024时，打印出的正好是sector的数目。也是partition的大小(同size概念)
2. 当sector_size不等于1024时，把sector数目和sector_size结合起来換算成大小(同size概念)

可见， fdisk打印的Blocks一栏其实是partition的大小。下面测试一下：

$ sudo mount /dev/sda1 /media/disk/
$ df /media/disk/
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1               790556     48176    702220   7% /media/disk
790556 是文件系统总大小。
上面的fdisk -b 1024 -l 命令得到的1606437是sector的数目。这样计算：

1606437 * 512 / 1024 =803218 是约等于790556 的。partition的大小是比file system大是因为file system需要存放一些信息.

总

1. 要想得到一个partition占有用多少个sector的话, 可以使用fdisk -b 1024这种方式得到
2. 不加-b参数的fdisk命令打印的Blocks一栏其实是表示partition的大小(以K为单位)
3. 还没有能力找出Kernel buffer cache block size, "block size" 这一条的实际代码

Wednesday, September 8, 2010

pkg-config

感觉centOS上的fuse版本太低，还是从源代码安装。

简单编译后：
./configure --prefix=/usr/local/fuse

fuse-python-binding就无法安装。问题是fuse-python-binding的setup.py需要使用pkg-config取得编译参数。

pkg-config --list-all |grep fuse
无法找到pkg-config需要的*.pc metainfornation文件。

可以手工增加：

$ PKG_CONFIG_PATH=/usr/local/fuse/lib/pkgconfig/ pkg-config --list-all |grep fuse
fuse fuse - Filesystem in Userspace

所以，下面的方法可以安装：
$ sudo PKG_CONFIG_PATH=/usr/local/fuse/lib/pkgconfig/ python setup.py install

[jessinio@niowork site-packages]$ /usr/local/python2.6/bin/python -c "import fuse"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "fuse.py", line 27, in <module>
from fuseparts._fuse import main, FuseGetContext, FuseInvalidate
ImportError: libfuse.so.2: cannot open shared object file: No such file or directory

需要增加lib路径：
$ sudo sh -c "echo /usr/local/fuse/lib >> /etc/ld.so.conf.d/fuse.conf "
$ sudo ldconfig

Saturday, September 4, 2010

prefork服务器方式

WEB服务器一般都有两种模式处理动态服务:
1. prefork
2. thread

一种是使用多进程, 另一种是使用多线程. 它们的具体的实现方式自己其实并不清楚.
最近公司的python WEB服务使用较多的内存. 为了清楚其中的问题. 需要学习flup(django需要此库)代码. ( 人家写的代码实在好看 )

很多软件都有prefork方式, 下面是prefork模型:

每个nginx子进程都分别调用accept得到用户发来的80端口请求.
之所以每个子进程都可以调用accept得到同一个socket的请求, 是因为fork出的子进程的file descriptor是指向同一个实体.

这样实现了多进程竟争得到socket请求.

每个子进程又可以使用epoll, 线程等方式并发处理众多来自80端口的请求. fast cgi server也是类似的方法.
nginx与fast cgi server之间使用socket通信, 使用fast cgi协议.

每个flup work某一时刻只服务一个页面请求. 完成请求后可以重新接受请求.

使用prefork模型的方式。所以在PreforkServer类中。

父进程的主体是一个loop:

119         # Main loop.
120         while self._keepGoing:
121             # Maintain minimum number of children.
122             while len(self._children) < self._maxSpare:
123                 if not self._spawnChild(sock): break

可以看出，父进程是永远期待子进程的数目为一个数值的。这个数值为maxSpare
粗粗看上去，有些野。但是父进程是有回收子进程的策略的，如下：

169             # See who and how many children are available.
170             availList = filter(lambda x: x[1]['avail'],
self._children.items())
171             avail = len(availList)

上面的代码可以知道，父进程会一直收集可用的子进程。所谓可用是没有在工作
的子进程，可以从子进程的代码中看出, 如下：
370             # Notify parent we're no longer available.
371             self._notifyParent(parent, '\x00')
372
373             # Do the job.
374             self._jobClass(clientSock, addr, *self._jobArgs).run()

上面的代码是子进程在调用jobClass.run之前，通知父进程自己是 no longer
available的。

父进程维护着“可用的子进程“数目是为了可以了解负载情况，判断是否需要产生
更多的子进程，如下：
172
173             if avail < self._minSpare:
174                 # Need to spawn more children.
175                 while avail < self._minSpare and \
176                       len(self._children) < self._maxChildren:
177                     if not self._spawnChild(sock): break
178                     avail += 1

上面的代码使用了两个数据： minSpare和maxChildren。

minSpare是表示最少“可用子进程“数
maxChildren是表示最大“子进程”数，是空闲与工作的子进程总数

如果“可用子进程”数还剩一定数据，会被回收，如下：

179             elif avail > self._maxSpare:
180                 # Too many spares, kill off the extras.
181                 pids = [x[0] for x in availList]
182                 pids.sort()
183                 pids = pids[self._maxSpare:]
184                 for pid in pids:
185                     d = self._children[pid]
186                     d['file'].close()
187                     d['file'] = None
188                     d['avail'] = False

从上面的代码可以看到， fast cgi在启动时的使用到的三个参数的作用应该是：
minspare: 最小“空闲进程数“
maxspare: 常规进程数，即fast cgi启动后， fastcgi进程数是大于等于此值
maxChildren: 最大进程数，此值主要是用于防止内存被使用完的。