磨刀不误砍柴工: 2010

Tuesday, December 14, 2010

ip route

笔记
route是什么意思?
这个词的字面意思一直不知道是什么意思。其实这不是一个汉语词汇。汉语词汇里只有一个“路由单” ，为一名词。路由(route)有两种意思：
1. 选择某路线
2. 与"路由单"的意思一致: 指旅途所经历的地名单

选择路线的依据是"目的地".

In the common case, route selection is based completely on the destination address. Conventional (as opposed to policy-based) IP networking relies on only the destination address to select a route for a packet.

但是随着发展, 仅仅是destination方式的route方式不能满足要求:

With the prevalence of low cost bandwidth, easily configured VPN tunnels, and increasing reliance on networks, the technique of selecting a route based solely on the destination IP address range no longer suffices for all situations.

linux对应这种发展的具体落实：

Since kernel 2.2, linux has supported policy based routing through the use of multiple routing tables and the routing policy database (RPDB). Together, they allow a network administrator to conﬁgure a machine select different routing tables and routes based on a number of criteria.

意思大概是如下两件事物:
1. linux支持多routing tables. routing policy database (RPDB)
2. 每张表有独立的规则. policy based routing

平时使用的路由都是由destination成唯一条件(比如使用route命令打印的結果). 那么policy based routing有什么重要呢?

In fact, advanced routing could more accurately be called policy-based networking.

下面的一段话, 描述了linux在路由数据包时, policy based routing使用的多种实现方法

Selectors available for use in policy-based routing are attributes of a packet passing through the linux routing code. The source address of a packet, the ToS ﬂags, an fwmark (a mark carried through the kernel in the data structure representing the packet), and the interface name on which the packet was received are attributes which can be used as selectors. By selecting a routing table based on packet attributes, an administrator can have granular control over the network path of any packet.

selector确定使用那张routing table.

使用人类文字描述Linux选择线路不是很容易理解, 如下一段伪代码比较好：
if packet.routeCacheLookupKey in routeCache :
   route = routeCache[ packet.routeCacheLookupKey ]
else
   for rule in rpdb :
       if packet.rpdbLookupKey in rule : (rule为下表的RPDB对象)
           routeTable = rule[ lookupTable ] (routeTable为下表的route table对象)
           if packet.routeLookupKey in routeTable :
                route = route_table[ packet.routeLookup_key ]

把rpdb为routing table, 规则(rule)都在DB中，每条rule有不同的属性(这里的属性包括上面提到的attributes).

伪代码中的LookupKey是代表下表中具体的一条属性. 所以, 其实上面的伪代码是很N多if语句的.

* 斜体字的属性是可选的. 如果存在就判断, 不存在不判断.

从上面的可以知道, route table起到:
1. 组织rule的作用
2. 同类的rule会拥有一组属性.

从上面的表中可以知道, 每个packet的destination和source是必定被用于路由, 但是不唯一确定条件.

linux system administrator查看上面的三种数据的方法:
1. route cache 表: ip route show cache
2. 每张RPDB表: ip rule list table 表名
3. 列出全部route table: ip rule show

Sunday, December 12, 2010

ethernet

开始

以太网层本时很少关注. 最近一次关注是理解LVS时. 最近在看<<Guide to IP Layer Network
Administration with Linux>>, 做做笔记, 随便动动手. 加深记忆.

被操作的机器上只有网关的物理地址:
$ arp -n
Address                  HWtype HWaddress           Flags Mask            Iface
10.20.129.1              ether   00:0F:E2:D3:BE:B8   C                     eth0

进行如下动作:
$ ping 10.20.129.32

把ping动作发出的包抓下来

$ sudo tcpdump -ent -i eth0 arp or icmp

....(截掉).....
00:23:ae:93:d9:26 > Broadcast, ethertype ARP (0x0806), length 42: arp who-has 10.20.129.32 tell 10.20.129.19
00:1e:4f:ad:41:58 > 00:23:ae:93:d9:26, ethertype ARP (0x0806), length 60: arp reply 10.20.129.32 is-at 00:1e:4f:ad:41:58
00:23:ae:93:d9:26 > 00:1e:4f:ad:41:58, ethertype IPv4 (0x0800), length 98: 10.20.129.19 > 10.20.129.32: ICMP echo request, id 26119, seq 1, length 64
00:1e:4f:ad:41:58 > 00:23:ae:93:d9:26, ethertype IPv4 (0x0800), length 98: 10.20.129.32 > 10.20.129.19: ICMP echo reply, id 26119, seq 1, length 64
....(截掉).....

ICMP包在ethernet层之上, 需要使用ethernet发数据, 需要物理地址. 为了得到物理地址使用到ARP协议.

ARP过程与如下命令一致: $ sudo arping -I eth0 10.20.129.32这一条命令表示向网段内查询某IP对应的MAC地址.
查看ARP表:
$ arp -n
Address                  HWtype HWaddress           Flags Mask            Iface
10.20.129.1              ether   00:0F:E2:D3:BE:B8   C                     eth0
10.20.129.32             ether   00:1E:4F:AD:41:58   C                     eth0
增加了一个记录

arping命令 -A 参数: ARP announcement, 也称为gratuitous ARP

$ sudo arping -A -c 3 -I eth0 10.20.129.19
tcpdump的抓包结果:
00:23:ae:93:d9:26 > Broadcast, ethertype ARP (0x0806), length 42: arp reply 10.20.129.19 is-at 00:23:ae:93:d9:26
00:23:ae:93:d9:26 > Broadcast, ethertype ARP (0x0806), length 42: arp reply 10.20.129.19 is-at 00:23:ae:93:d9:26
00:23:ae:93:d9:26 > Broadcast, ethertype ARP (0x0806), length 42: arp reply 10.20.129.19 is-at 00:23:ae:93:d9:26

从上面的信息看出, -A是向整个网段通知自己的IP. 默认情况下, linux 不会接受这样的包.
由arp_accept选项控制, 如下文档:

arp_accept - BOOLEAN
    Define behavior for gratuitous ARP frames who's IP is not
    already present in the ARP table:
    0 - don't create new entries in the ARP table
    1 - create new entries in the ARP table

如果看知道 gratuitous ARP 包的具体用法，可以移步到： http://wiki.wireshark.org/Gratuitous_ARP

arping命令 -D 参数: Duplicate address detection mode (DAD)

这个参数相当有用: 用于排除网段中有IP冲突. 来个实例:

root@jessinio-laptop:~# ifconfig wlan0 |head -n 2
wlan0 Link encap:Ethernet HWaddr 00:16:cf:68:5b:a7
inet addr:192.168.0.106 Bcast:192.168.0.255 Mask:255.255.255.0

root@jessinio-laptop:~# arping -D -I wlan0 192.168.0.106
ARPING 192.168.0.106 from 0.0.0.0 wlan0
Unicast reply from 192.168.0.106 [00:18:41:FE:26:5F] 90.390ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)

可以看出, 192.168.0.106 被两台机器使用, 一台是本志的00:16:cf:68:5b:a7 , 另一台是00:18:41:FE:26:5F.

抓包信息:
00:16:cf:68:5b:a7 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.0.106 (ff:ff:ff:ff:ff:ff) tell 0.0.0.0, length 28
00:18:41:fe:26:5f > 00:16:cf:68:5b:a7, ethertype ARP (0x0806), length 42: Reply 192.168.0.106 is-at 00:18:41:fe:26:5f, length 28

结束

以一个问题为结束: 使用ICMP协议能否得知网段中有其它机器使用自己的IP呢? 比如, ping自己的IP.

答案是不可以的. 因为ICMP包基本没有发出来. 回流了. 例如:

$ ping 10.20.129.19
产生的数据包不会流过ethernet卡, 从route表就可以知道:

$ ip route list table local
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1
broadcast 10.20.129.0 dev eth0 proto kernel scope link src 10.20.129.19
local 10.20.129.19 dev eth0 proto kernel scope host src 10.20.129.19
broadcast 10.20.129.127 dev eth0 proto kernel scope link src 10.20.129.19
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1

Friday, November 26, 2010

iptables与流量统计

要对机房某台机器进行临时性的内、外网流量统计分开。这种事最好不要放到应用层统计，因为效率很成问题。第一时间想到处于网络二、三层的工具，效率不会被过多影响。如ntop之类使用libpcap库的工具。
其实iptables也是有包统计。因为每个包都经过它。而且不用安装任何工具。

加入两条规则：
jessinio@jessinio-laptop:~$ sudo iptables -t filter -A INPUT -p all -s 174.121.79.132 -j ACCEPT
jessinio@jessinio-laptop:~$ sudo iptables -t filter -A OUTPUT -p all -d 174.121.79.132 -j ACCEPT

情况：
jessinio@jessinio-laptop:~$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all -- web124.webfaction.com anywhere

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all -- anywhere             web124.webfaction.com

結果：
jessinio@jessinio-laptop:~$ sudo iptables -L -n -v
Chain INPUT (policy ACCEPT 11M packets, 5033M bytes)
pkts bytes target     prot opt in     out     source               destination
   10 2088 ACCEPT     all -- *      *       174.121.79.132       0.0.0.0/0

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 14M packets, 14G bytes)
pkts bytes target     prot opt in     out     source               destination
   48 25152 ACCEPT     all -- *      *       0.0.0.0/0            174.121.79.132

Thursday, November 18, 2010

seteuid

# ps axjf|grep -v grep|grep ftp
    1 13871 13871 13871 ?           -1 Ss       0   0:00 /usr/sbin/vsftpd /etc/vsftpd/vsftpd.conf
13871 14146 14146 14146 ?           -1 Ss      99   0:00 \_ /usr/sbin/vsftpd /etc/vsftpd/vsftpd.conf
14146 14148 14146 14146 ?           -1 S      509   0:00      \_ /usr/sbin/vsftpd /etc/vsftpd/vsftpd.conf

id是99进程可以产生id为509的进程，原来fork后，子进程还是可以调用seteuid切到其它uid的，这之前还不知道。如下测试代码：

#!/usr/bin/python
#coding:utf-8

import os
import time

# 同一进程可以从0转成其它的
os.seteuid(99)
os.seteuid(0)

os.seteuid(99)
pid = os.fork()
# child
if pid == 0:
    # 子进程还是可以使用seteuid回到0的
    os.seteuid(0)
    time.sleep(10)
else:
    print pid
    time.sleep(10)

运行结果：

[jessinio@niowork tmp]$ ps axuf|grep root.py
nobody   28559 0.0 0.0 74192 2912 pts/11   T    13:56   0:00 |   \_ python root.py
root     28560 0.0 0.0 74188 1756 pts/11   T    13:56   0:00 |   |        \_ python root.py

Thursday, November 4, 2010

使用strace找出问题根本点

今天想备份一个svn仓库。但是碰到“鬼”了。如下：
$ sudo -u daemon HOME=/tmp /usr/local/subversion/bin/svnsync sync file:///data/repos/xxoo
svnsync: Revprop change blocked by pre-revprop-change hook (exit code 255) with no output.

看上去是svn的pre-revprop-change hook出了问题。但是无论我怎么折腾pre-revprop-change hook代码都不给力。

又怀疑是环境变量，连sudo的-E都使用上了还是一个屁用。

无奈下只能使用strace看看：
$ sudo -u daemon HOME=/tmp strace -f /usr/local/subversion/bin/svnsync sync file:///data/repos/xxoo 2>&1 |less

一定要使用-f，因为svnsync产生了子进程，这样可以连子进程都可以被trace。

看到一句相当可疑的trace条目：
[pid 6548] chdir(".") = -1 EACCES (Permission denied)

呃。。。。。细节。原来work directory是~, 切换成daemon运行的svnsync没有权限。。。。。

这样就没有问题了：
$ cd / && sudo -u daemon HOME=/tmp /usr/local/subversion/bin/svnsync sync file:///data/repos/xxoo

真是细节。

Thursday, October 14, 2010

devfs与udev

/dev 目录disk-based还是kernel-based的文件系统?
这个问题一直没有分开.
从自己的实践和<<Linux操作系统之奥秘>>一书, 明显/dev是disk-based的. 本人没有使用过devfs

linux2.4 kernel时代使用devfs文件系统. linux 2.6 kernel已经去掉devfs代码了.

想查查devfs的文档都不容易, 它的作者旧blog上的文章都找不到了.

google到
* http://www.linuxjournal.com/article/6035
* http://www.ibm.com/developerworks/linux/library/l-devfs.html
从时间上看很老的, 为了确认年代, 查看了kernel的timeline, 可以确认这是"同一时代"的产物:
* http://en.wikipedia.org/wiki/Linux_kernel#Timeline

没有用过devfs. 为了求真相, 问朋友借了个redhat8环境的linux, 结果devfs没有被默认编入kernel:
# cat /boot/config-2.4.18-14 |grep DEVFS
# CONFIG_DEVFS_FS is not set

但是redhat8下的/dev的确有18 thousand entries (比较惊人)
/dev 目录disk-based还是kernel-based的文件系统? 这个问题没有环境都没法亲眼看到. 主能找到的文档里找, 有些文件是使用pseudo filesystem称呼devfs的(例如这篇: http://www.linux.org/docs/ldp/howto/SCSI-2.4-HOWTO/devfs.html)

在 http://www.linuxjournal.com/article/6035 一文中, 描述了使用devfs的好处:
1. 系统自动管理/dev下的文件
2. 可以被read only 的方法mount到系统和/dev创建在non-unix file system上

文档特殊提到了non-unix file system. 因为dev entry是还有一些额外的信息. 下面在fat32文件系统上创建一个dev entry:

[jessinio@niowork NO_NAME]$ sudo mknod dev_entry c 240 1
mknod: `dev_entry': Operation not permitted

可以确定devfs一个存在于内存的文件系统.

devfs已经被淘汰, 被udev取代. udev比devfs带来什么好处?
如果想仔细了解有什么优点, 请阅读udev作者写的文章: http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev_vs_devfs

比较深刻的是两条:
1. 可以随意命名设备文件在/dev下的entry名
2. allow everyone to not care about major/minor numbers

到这里, 又需要把视线转移到sysfs上面了.

Monday, October 11, 2010

足够数目的getdents调用与文件数目引发的问题

网站的速度很慢。要求给个理由。于是登机top了一把。如下

PID USER      PR NI VIRT RES SHR S %CPU %MEM    TIME+ COMMAND
29131 liangqin 16   0 15068 3444 816 R 11.7 0.0   0:00.72 top
16371 kmmaster 16   0 181m 8484 3128 D 5.2 0.1   0:01.20 httpd
5726 kmmaster 15   0 182m 9352 3564 S 3.9 0.1   0:02.24 httpd
32548 kmmaster 16   0 183m 9.8m 3348 D 3.9 0.1   0:06.16 httpd
6199 kmmaster 16   0 183m 9444 3452 D 2.6 0.1   0:01.89 httpd
7697 kmmaster 16   0 182m 8848 3184 D 2.6 0.1   0:01.94 httpd
10536 kmmaster 16   0 181m 8692 3332 D 2.6 0.1   0:01.93 httpd
15102 kmmaster 16   0 181m 8420 3060 D 2.6 0.1   0:01.37 httpd
17993 kmmaster 16   0 181m 8708 3292 D 2.6 0.1   0:04.16 httpd
23185 kmmaster 16   0 182m 9420 3428 D 2.6 0.1   0:00.61 httpd
30189 kmmaster 16   0 181m 8724 3308 D 2.6 0.1   0:05.91 httpd
30337 kmmaster 16   0 183m 9.9m 3448 D 2.6 0.1   0:02.71 httpd

使用strace命令查看D状态的httpd, 都是调用getdents, stat, unlink等IO函数，如：
stat("/tmp/sess_bc45d5d1dd8739acceff8a3e0fec0585", {st_mode=S_IFREG|0600, st_size=0, ...}) = 0

使用ls， find之类的工具都无法对此目录(指/tmp)进行数据查看。行为freeze。也是进入D状态。

本想使用python的os.listdir函数的。但这个函数是读完目录的entry后才返回list的。也是要慢很长时间。

于是使用下面的C代码：
#include <sys/types.h>
#include <dirent.h>
#include <stdio.h>

int main(int argc, char * argv[])
{
        DIR *dirp = opendir("/tmp");
        struct dirent *retval;
        long long int t;
        for(; ; ){
                retval = readdir(dirp);
                if (retval == NULL) {break;}
                else {printf("%s\n", retval->d_name); t++;}
        }
        printf("%lld\n", t);
}
可以知道目录文件总数是 221346

httpd的进程数有3K！每个需要session的进程都要读/tmp目录下的entry，这个行为会对3K数目的httpd进程有很大的影响吗？

于是自己写了个测试代码，目的就是测试众多的readdir函数是否对进程有影响：
#coding:utf-8
import time
import os

for i in range(300):
    pid = os.fork()
    if pid > 0:
        break

for r in range(30):
    time.sleep(0.1)
    os.listdir("/tmp")

代码很简单，但是对于200K条目的directory来说，很给力！！出现大量的D状态进程：
(........被截去......)
905       7072 0.1 0.0 84244 11628 pts/2    D+   14:45   0:00 python listdir.py
905       7073 0.1 0.1 85136 12448 pts/2    D+   14:45   0:00 python listdir.py
905       7074 0.1 0.1 85136 12628 pts/2    D+   14:45   0:00 python listdir.py
905       7075 0.1 0.1 86304 13588 pts/2    D+   14:45   0:00 python listdir.py
905       7076 0.1 0.0 84764 12264 pts/2    D+   14:45   0:00 python listdir.py
905       7077 0.1 0.0 84504 12036 pts/2    D+   14:45   0:00 python listdir.py
905       7078 0.1 0.1 85136 12672 pts/2    D+   14:45   0:00 python listdir.py
905       7079 0.1 0.1 85916 13360 pts/2    D+   14:45   0:00 python listdir.py
905       7080 0.1 0.1 85916 13448 pts/2    D+   14:45   0:00 python listdir.py
905       7081 0.2 0.1 89728 16964 pts/2    D+   14:45   0:00 python listdir.py
905       7082 0.2 0.1 88268 15776 pts/2    D+   14:45   0:00 python listdir.py
905       7083 0.2 0.1 88268 15596 pts/2    D+   14:45   0:00 python listdir.py
905       7084 0.2 0.1 87344 14820 pts/2    D+   14:45   0:00 python listdir.py
905       7085 0.1 0.1 85136 12628 pts/2    D+   14:45   0:00 python listdir.py
905       7086 0.2 0.1 86304 13812 pts/2    D+   14:45   0:00 python listdir.py
(......被截去.......)

机器的内存使用量快速上升。

Wednesday, October 6, 2010

broadcast

很少使用UDP协议和原始数据包，所以对broadcast这种特殊地址使用不多。
知识总是关联在一起的。今天在看LVS-DR模式的配置时，发现对下面的配置有些不理解：

# ifconfig lo:0 IP_Adress broadcast IP_Adress netmask 255.255.255.255 up
# route add -host IP_Adress dev lo:0
* IP_Adress为IP地址。

如果只是为了配置LVS的话，就不需要理会上面的语句背后的原理，但作为技术控，很希望可以知道它背后后原理。

結果发现背后是好大一个坑，最经典的问题就是LVS的ARP问题：
* http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.arp_problem.html
都与broadcast有关系. 先对broadcast下手:

为什么 broadcast与IP_Adress相同，而不是常用的172.16.2.255这种特殊IP？
首先， broadcast有如下几种：
1. layer 2 broadcast
2. layer 3 broadcast
3. unicast
4. multicast

要知道broadcast的作用是"一对多", 一台机器发出的数据多台机器有兴趣接收. 这种特点是TCP协议没有的.
下面使用UDP协议来举个layer 3的例子:
接收端(调用bind函数), 这一端的机器可以多台:
import socket
import sys
x = ('<broadcast>', 51423)

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
s.bind(x)
(buf, address) = s.recvfrom(2048)
s.sendto("Hi", address)

发送端(调用send函数):
import socket
import sys
x = ('<broadcast>', 51423)

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
s.sendto("Hi", x)
(buf, address) = s.recvfrom(2048)
print "Received from %s: %s" % (address, buf)

send端使用的'<broadcast>' 很另类: 不是使用具体的IP地址, 而是使用代名词.
回到上面的ifconfig例子, 这个'<broadcast>' 其实就是NIC里的broadcast参数.
如果NIC的参数不同, '<broadcast>' 代表的意义就不同.
从发送端看:
send函数发出的数据包里的destination地址为'<broadcast>'
从接收端看:
recvfrom函数只接收destination地址为NIC里'<broadcast>' 参数的广播包.

所以, 上面的ifconfig设置明显是不想服务器接收layer 3的广播信息( 例如destination为192.168.0.255这种数据包)

从netmask的角度可以这样思考:
ifconfig lo:0 192.168.0.10 broadcast 192.168.0.10 netmask 255.255.255.255可以变形成ifconfig lo:0 192.168.0.10 netmask 255.255.255.255
它们与
ifconfig lo:0 192.168.0.10 netmask 255.255.255.0
是属于不同的subnet, 所以192.168.0.0/24的信息对于网段192.168.0.0/32是不会接收的

这种单一的广播地址被称为 unicast
broadcast还有layer 2的. 典型的例子就是arp协议. 使用的以太网广播地址: FF.FF.FF.FF.FF.FF 作为destination

linux下没有CLI接口的命令可以发出arp请求包. 因为arp功能放在kernel中(可以看手册man 7 arp), layer 3的数据压到layer 2时kernel为自动调用arp请求包(如果是需求的话).
如果要手动发出这种请求也是可以的, 比如这段代码: http://svn.pythonfr.org/public/pythonfr/utils/network/arp-flood.py

当ping一个IP时, 系统的arp表示里没有与IP对应的条目时kernel是会发出arp请求包的, 所以为了测试, 可以在清除arp条目的情况下,在两机之间ping对方.
如下代码:
import socket
soc = socket.socket(socket.PF_PACKET, socket.SOCK_RAW) #create the raw-socket
soc.bind(("wlan0",0x0806)) # ether type for ARP
data = soc.recv(1024)

程序属于接收端, 程序运行后会一直block, 直到接收到一个arp请求. 向子网的所有机器询问的MAC, 所以是"一对多", 这时就需要使用到广播地址, layer 2的广播地址为FF.FF.FF.FF.FF.FF.

与layer 3相比, layer 2的广播地址不是在NIC上配置的, 但是LVS-DR模式又希望real server不要响应和发出arp请求, 于是, LVS-DR的arp问题就产生了. 也就是文章开头的route add命令的原理. 至于这个原理这里就不写了. 认真阅读下面的文章即可, 下面的文章涉及了linux kernel的多个版本, 如2.0.x, 2.2.x, 2.6.x.
http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.arp_problem.html

Monday, October 4, 2010

route table

平时在了解系统的route表和配置系统的route表时, 都是比较喜欢使用route这个命令的. 但是, 本人感觉到此命令输出的結果与一些讲解系统网络的资料对不上号的, 例如:

在Routing Decision处就和route命令的输出結果对不上号:
jessinio@jessinio-laptop:~$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.0.0     0.0.0.0         255.255.255.0   U     2      0        0 wlan0
169.254.0.0     0.0.0.0         255.255.0.0     U     1000   0        0 wlan0
0.0.0.0         192.168.0.1     0.0.0.0         UG    0      0        0 wlan0

結果很明显, route输出的結果只是系统发向外面的数据包的routing, 没有陈述数据包进入系统的routing.

这种感觉待续了很久一段时间. 今天看到这样一段话:

Linux has a different approach for routing than other UNIX. The way things are implemented on Linux is more flexible and powerful than traditional ways. Legacy utilities such as ifconfig and route are still valid, but incomplete. This is because they do not give access to the advanced routing layer present on Linux. The utility ip (part of iproute2) is the current tool for networking related stuff under Linux. This tool will be the focus of this section.

文字中提到了,route命令得到的其实是不完全的数据. 在Linux下, 最本地化的应该是ip命令.
google找到一份很老的文档： http://linux-ip.net/html/routing-tables.html 上面写得很清楚了, 如下一小段：

The routing table manipulated by the conventional route command is the main routing table. Additionally, the use of both ip address and ifconfig will cause the kernel to alter the local routing table (and usually the main routing table). For further documentation on how to manipulate the other routing tables, see the command description of ip route.

route命令得到和设置的仅仅是冰山一角。

iproute2工具集手册：http://www.policyrouting.org/iproute2.doc.html

Saturday, September 11, 2010

block size

多处都存在block size，概念同名但是意义不同，相当迷惑
这哥们就是一个被迷惑的人： http://www.linuxforums.org/forum/misc/5654-linux-disk-block-size-help-please.html

上面的URL列出了如下几种block:
1. Hardware block size, "sector size"
2. Filesystem block size, "block size"
3. Kernel buffer cache block size, "block size"
4. Partition table block size, "cylinder size"

我对fdisk打印的blocks一栏很不解. 需要dig一下.
先来看看fdisk打印出来的block：

jessinio@jessinio-laptop:/ $ sudo fdisk -l
Disk /dev/sda: 250.1 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00038329

   Device Boot      Start         End      Blocks   Id System
/dev/sda1   *           1         100      803218+ 83 Linux
/dev/sda2             101       30401   243392782+ 8e Linux LVM

上面的内容和下面的是一致的：
jessinio@jessinio-laptop:/media/82d236f2-3592-4040-801c-3c2049ddfb95$ sudo fdisk -b 512 -l
Warning: the -b (set sector size) option should be used with one specified device
Disk /dev/sda: 250.1 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00038329

   Device Boot      Start         End      Blocks   Id System
/dev/sda1   *           1         100      803218+ 83 Linux
/dev/sda2             101       30401   243392782+ 8e Linux LVM

但是下面的就比较奇怪了：
jessinio@jessinio-laptop:/media/82d236f2-3592-4040-801c-3c2049ddfb95$ sudo fdisk -b 1024 -l
Warning: the -b (set sector size) option should be used with one specified device
Disk /dev/sda: 250.1 GB, 250059350016 bytes
255 heads, 63 sectors/track, 15200 cylinders
Units = cylinders of 16065 * 1024 = 16450560 bytes
Disk identifier: 0x00038329

   Device Boot      Start         End      Blocks   Id System
/dev/sda1   *           1         100     1606437   83 Linux
/dev/sda2             101       30401   486785565   8e Linux LVM

指定更大的硬件sector size反而block增加, 这是为什么呢?。下面是fdisk的相关代码：
sector_size变量的来源：
759 static void
760 get_sectorsize(int fd) {
761 #if defined(BLKSSZGET)
762     if (!user_set_sector_size &&
763         linux_version_code() >= MAKE_VERSION(2,3,3)) {
764         int arg;
765         if (ioctl(fd, BLKSSZGET, &arg) == 0)
766             sector_size = arg;
767         if (sector_size != DEFAULT_SECTOR_SIZE)
768             printf(_("Note: sector size is %d (not %d)\n"),
769                    sector_size, DEFAULT_SECTOR_SIZE);
770     }
771 #else
772     /* maybe the user specified it; and otherwise we still
773        have the DEFAULT_SECTOR_SIZE default */
774 #endif
775 }
DEFAULT_SECTOR_SIZE 在fdisk.h中定义是
#define DEFAULT_SECTOR_SIZE     512

或者是用户指定的user_set_sector_size：sector_size = atoi(optarg);

打印时使用的代码是：
1731             unsigned int psects = get_nr_sects(p);
1732             unsigned int pblocks = psects;
1733             unsigned int podd = 0;
1734
1735             if (sector_size < 1024) {
1736                 pblocks /= (1024 / sector_size);
1737                 podd = psects % (1024 / sector_size);
1738             }
1739             if (sector_size > 1024)
1740                 pblocks *= (sector_size / 1024);
1741                         printf(
1742                 "%s %c %11lu %11lu %11lu%c %2x %s\n",
1743             partname(disk_device, i+1, w+2),
1744 /* boot flag */     !p->boot_ind ? ' ' : p->boot_ind == ACTIVE_FLAG
1745             ? '*' : '?',
1746 /* start */     (unsigned long) cround(get_partition_start(pe)),
1747 /* end */       (unsigned long) cround(get_partition_start(pe) + psects
1748                 - (psects ? 1 : 0)),
1749 /* odd flag on end */   (unsigned long) pblocks, podd ? '+' : ' ',
1750 /* type id */       p->sys_ind,
1751 /* type name */     (type = partition_type(p->sys_ind)) ?
1752             type : _("Unknown"));
1753             check_consistency(p, i);

1. 当sector_size刚好等于1024时，打印出的正好是sector的数目。也是partition的大小(同size概念)
2. 当sector_size不等于1024时，把sector数目和sector_size结合起来換算成大小(同size概念)

可见， fdisk打印的Blocks一栏其实是partition的大小。下面测试一下：

$ sudo mount /dev/sda1 /media/disk/
$ df /media/disk/
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1               790556     48176    702220   7% /media/disk
790556 是文件系统总大小。
上面的fdisk -b 1024 -l 命令得到的1606437是sector的数目。这样计算：

1606437 * 512 / 1024 =803218 是约等于790556 的。partition的大小是比file system大是因为file system需要存放一些信息.

总

1. 要想得到一个partition占有用多少个sector的话, 可以使用fdisk -b 1024这种方式得到
2. 不加-b参数的fdisk命令打印的Blocks一栏其实是表示partition的大小(以K为单位)
3. 还没有能力找出Kernel buffer cache block size, "block size" 这一条的实际代码

Wednesday, September 8, 2010

pkg-config

感觉centOS上的fuse版本太低，还是从源代码安装。

简单编译后：
./configure --prefix=/usr/local/fuse

fuse-python-binding就无法安装。问题是fuse-python-binding的setup.py需要使用pkg-config取得编译参数。

pkg-config --list-all |grep fuse
无法找到pkg-config需要的*.pc metainfornation文件。

可以手工增加：

$ PKG_CONFIG_PATH=/usr/local/fuse/lib/pkgconfig/ pkg-config --list-all |grep fuse
fuse fuse - Filesystem in Userspace

所以，下面的方法可以安装：
$ sudo PKG_CONFIG_PATH=/usr/local/fuse/lib/pkgconfig/ python setup.py install

[jessinio@niowork site-packages]$ /usr/local/python2.6/bin/python -c "import fuse"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "fuse.py", line 27, in <module>
from fuseparts._fuse import main, FuseGetContext, FuseInvalidate
ImportError: libfuse.so.2: cannot open shared object file: No such file or directory

需要增加lib路径：
$ sudo sh -c "echo /usr/local/fuse/lib >> /etc/ld.so.conf.d/fuse.conf "
$ sudo ldconfig

Saturday, September 4, 2010

prefork服务器方式

WEB服务器一般都有两种模式处理动态服务:
1. prefork
2. thread

一种是使用多进程, 另一种是使用多线程. 它们的具体的实现方式自己其实并不清楚.
最近公司的python WEB服务使用较多的内存. 为了清楚其中的问题. 需要学习flup(django需要此库)代码. ( 人家写的代码实在好看 )

很多软件都有prefork方式, 下面是prefork模型:

每个nginx子进程都分别调用accept得到用户发来的80端口请求.
之所以每个子进程都可以调用accept得到同一个socket的请求, 是因为fork出的子进程的file descriptor是指向同一个实体.

这样实现了多进程竟争得到socket请求.

每个子进程又可以使用epoll, 线程等方式并发处理众多来自80端口的请求. fast cgi server也是类似的方法.
nginx与fast cgi server之间使用socket通信, 使用fast cgi协议.

每个flup work某一时刻只服务一个页面请求. 完成请求后可以重新接受请求.

使用prefork模型的方式。所以在PreforkServer类中。

父进程的主体是一个loop:

119         # Main loop.
120         while self._keepGoing:
121             # Maintain minimum number of children.
122             while len(self._children) < self._maxSpare:
123                 if not self._spawnChild(sock): break

可以看出，父进程是永远期待子进程的数目为一个数值的。这个数值为maxSpare
粗粗看上去，有些野。但是父进程是有回收子进程的策略的，如下：

169             # See who and how many children are available.
170             availList = filter(lambda x: x[1]['avail'],
self._children.items())
171             avail = len(availList)

上面的代码可以知道，父进程会一直收集可用的子进程。所谓可用是没有在工作
的子进程，可以从子进程的代码中看出, 如下：
370             # Notify parent we're no longer available.
371             self._notifyParent(parent, '\x00')
372
373             # Do the job.
374             self._jobClass(clientSock, addr, *self._jobArgs).run()

上面的代码是子进程在调用jobClass.run之前，通知父进程自己是 no longer
available的。

父进程维护着“可用的子进程“数目是为了可以了解负载情况，判断是否需要产生
更多的子进程，如下：
172
173             if avail < self._minSpare:
174                 # Need to spawn more children.
175                 while avail < self._minSpare and \
176                       len(self._children) < self._maxChildren:
177                     if not self._spawnChild(sock): break
178                     avail += 1

上面的代码使用了两个数据： minSpare和maxChildren。

minSpare是表示最少“可用子进程“数
maxChildren是表示最大“子进程”数，是空闲与工作的子进程总数

如果“可用子进程”数还剩一定数据，会被回收，如下：

179             elif avail > self._maxSpare:
180                 # Too many spares, kill off the extras.
181                 pids = [x[0] for x in availList]
182                 pids.sort()
183                 pids = pids[self._maxSpare:]
184                 for pid in pids:
185                     d = self._children[pid]
186                     d['file'].close()
187                     d['file'] = None
188                     d['avail'] = False

从上面的代码可以看到， fast cgi在启动时的使用到的三个参数的作用应该是：
minspare: 最小“空闲进程数“
maxspare: 常规进程数，即fast cgi启动后， fastcgi进程数是大于等于此值
maxChildren: 最大进程数，此值主要是用于防止内存被使用完的。

Friday, September 3, 2010

父类方法使用子类的方法与数据

对OO其实是不懂, 只是会一个class( 模式说白了不会. 接口是学了go语言才知道是什么. (-_-)! )

今天在看flup项目的代码时, 有一段发码看不懂. 其实是自己对OO的不懂:

1 #coding:utf-8
2
3 class Parent(object):
4     def __init__(self, name):
5         self._name = "\t父类: %s" % name
6     def whoami(self):
7         print self._name
8         print "在父类方面中调用子类的方法:"
9         self._print()
10
11
12 class Child(Parent):
13     def __init__(self, name):
14         self._nickname = "\t子类：%s" % name
15         Parent.__init__(self, "大明")
16
17     def _print(self):
18         print self._nickname
19
20     def run(self):
21         # 下面的两种调用方法是一样的
22         print "使用Parent.method(self)的方法调用父类方法"
23         Parent.whoami(self)
24         print "使用self.method()的方法调用父类方法"
25         self.whoami()
26
27
28
29 if __name__ == "__main__":
30     c = Child("小明")
31     c.run()

平时一般都是使用self.whoami(), 自为发父方法无法使用子类的东西. 但是如果使用Parent.whoami(self) 的方式, 就会明白自己的理解其实是错的.

基本问题是自己没有对书中的Parent.__init__(self, "大明") 一句理解到位

嗯.... 自己的水平太差了.

Sunday, August 15, 2010

python2.4与2.5的异常基类变化

工作中，有一服务需要从freeBSD迁移到centOS上。其中有一python脚本有些鬼异。
在freeBSD上可以正常，在centOS上不能正常。最后目标定位在脚本最后的代码上，大概如下：

try:
    if 测试条件:
       do_someting()
       sys.exit(0)
   else:
       sys.exit(1)
except Exception:
        sys.exit(1)

centOS上， python的版本是2.4的。 freeBSD下安装的是2.5.
旧版本的Exception是基类：因此在2.4上脚本永远是返回1的。
到了新的版本BaseException才是基类：因此SystemExit和Exception是同一级。脚本可以返回0

这是迁移后的主要问题。其实，应该是作者不知道sys.exit的后果。作者把sys.exit当成os._exit了。
如下是APUE里的一幅图：

标准C里的exit函数在调用后，不是马上退出程序，而是运行一些收尾工作。比如把buffer里的数据flush到磁盘等等工作。然后才是调用真正的退出函数：_exit

python的sys.exit是抛出一个异常：SystemExit， python会默认会catch这个异常，大意为：

try:
我们的python代码
exception SystemExit:
做一些收尾工作，如flush之类的

递归, 循环, 迭代

工作中, 需要有历遍目录进行一些操作. python自带的os.walk很强大, 但是没有maxdepth这种层数参数.

os.walk是原本是一个递归. 下面是它的代码:
273     try:
274         # Note that listdir and error are globals in this module due
275         # to earlier import-*.
276         names = listdir(top)
277     except error, err:
278         if onerror is not None:
279             onerror(err)
280         return
281
282     dirs, nondirs = [], []
283     for name in names:
284         if isdir(join(top, name)):
285             dirs.append(name)
286         else:
287             nondirs.append(name)
288
289     if topdown:
290         yield top, dirs, nondirs
291     for name in dirs:
292         path = join(top, name)
293         if followlinks or not islink(path):
294             for x in walk(path, topdown, onerror, followlinks):
295                 yield x
296     if not topdown:
297         yield top, dirs, nondirs

BTW:: 这里需要注意*nux的ENAMETOOLONG 错误.

递归无法按"层"历遍目录. 它只能一条一条路径走尽.

朋友说使用循环也可以实现递归, 这种代码从来没有写过. 也一直在迷惑。
但是受 http://www.devshed.com/c/a/Python/Basic-Threading-in-Python/2/ 的启发.
突然明白循环是怎么做到递归的。于是自己使用循环自己写了一个历遍的函数:

def loop_walk(top, n):
    stack = Queue.Queue()
    sub_stack = Queue.Queue()
    dirs = []
    files = []
    error = None
    stack.put([top])

    while True:
        if stack.empty():
            # stack被取完了，下一层队列
            stack, sub_stack = sub_stack, stack
            if n <= 1:
                yield [], [], None
                return
            else:
                n -= 1
            continue
        else:
            try:
                top_list = stack.get(False)
            except Queue.Empty:
                yield [], [], None
                return

        for top in top_list:
            try:
                for item in os.listdir(top):
                    item = os.path.join(top, item)
                    if os.path.isdir(item):
                        dirs.append(item)
                    else:
                        files.append(item)
            except:
                # 出错，比如没有权限
                #traceback.print_exc()
                error = top
            yield dirs, files, error
            if dirs:
                sub_stack.put(dirs)
            dirs = []
            files = []
            error = None

代码长度double了.........

磨刀不误砍柴工