磨刀不误砍柴工

Tuesday, December 14, 2010

ip route

笔记
route是什么意思?
这个词的字面意思一直不知道是什么意思。其实这不是一个汉语词汇。汉语词汇里只有一个“路由单” ，为一名词。路由(route)有两种意思：
1. 选择某路线
2. 与"路由单"的意思一致: 指旅途所经历的地名单

选择路线的依据是"目的地".

In the common case, route selection is based completely on the destination address. Conventional (as opposed to policy-based) IP networking relies on only the destination address to select a route for a packet.

但是随着发展, 仅仅是destination方式的route方式不能满足要求:

With the prevalence of low cost bandwidth, easily configured VPN tunnels, and increasing reliance on networks, the technique of selecting a route based solely on the destination IP address range no longer suffices for all situations.

linux对应这种发展的具体落实：

Since kernel 2.2, linux has supported policy based routing through the use of multiple routing tables and the routing policy database (RPDB). Together, they allow a network administrator to conﬁgure a machine select different routing tables and routes based on a number of criteria.

意思大概是如下两件事物:
1. linux支持多routing tables. routing policy database (RPDB)
2. 每张表有独立的规则. policy based routing

平时使用的路由都是由destination成唯一条件(比如使用route命令打印的結果). 那么policy based routing有什么重要呢?

In fact, advanced routing could more accurately be called policy-based networking.

下面的一段话, 描述了linux在路由数据包时, policy based routing使用的多种实现方法

Selectors available for use in policy-based routing are attributes of a packet passing through the linux routing code. The source address of a packet, the ToS ﬂags, an fwmark (a mark carried through the kernel in the data structure representing the packet), and the interface name on which the packet was received are attributes which can be used as selectors. By selecting a routing table based on packet attributes, an administrator can have granular control over the network path of any packet.

selector确定使用那张routing table.

使用人类文字描述Linux选择线路不是很容易理解, 如下一段伪代码比较好：
if packet.routeCacheLookupKey in routeCache :
   route = routeCache[ packet.routeCacheLookupKey ]
else
   for rule in rpdb :
       if packet.rpdbLookupKey in rule : (rule为下表的RPDB对象)
           routeTable = rule[ lookupTable ] (routeTable为下表的route table对象)
           if packet.routeLookupKey in routeTable :
                route = route_table[ packet.routeLookup_key ]

把rpdb为routing table, 规则(rule)都在DB中，每条rule有不同的属性(这里的属性包括上面提到的attributes).

伪代码中的LookupKey是代表下表中具体的一条属性. 所以, 其实上面的伪代码是很N多if语句的.

* 斜体字的属性是可选的. 如果存在就判断, 不存在不判断.

从上面的可以知道, route table起到:
1. 组织rule的作用
2. 同类的rule会拥有一组属性.

从上面的表中可以知道, 每个packet的destination和source是必定被用于路由, 但是不唯一确定条件.

linux system administrator查看上面的三种数据的方法:
1. route cache 表: ip route show cache
2. 每张RPDB表: ip rule list table 表名
3. 列出全部route table: ip rule show

Sunday, December 12, 2010

ethernet

开始

以太网层本时很少关注. 最近一次关注是理解LVS时. 最近在看<<Guide to IP Layer Network
Administration with Linux>>, 做做笔记, 随便动动手. 加深记忆.

被操作的机器上只有网关的物理地址:
$ arp -n
Address                  HWtype HWaddress           Flags Mask            Iface
10.20.129.1              ether   00:0F:E2:D3:BE:B8   C                     eth0

进行如下动作:
$ ping 10.20.129.32

把ping动作发出的包抓下来

$ sudo tcpdump -ent -i eth0 arp or icmp

....(截掉).....
00:23:ae:93:d9:26 > Broadcast, ethertype ARP (0x0806), length 42: arp who-has 10.20.129.32 tell 10.20.129.19
00:1e:4f:ad:41:58 > 00:23:ae:93:d9:26, ethertype ARP (0x0806), length 60: arp reply 10.20.129.32 is-at 00:1e:4f:ad:41:58
00:23:ae:93:d9:26 > 00:1e:4f:ad:41:58, ethertype IPv4 (0x0800), length 98: 10.20.129.19 > 10.20.129.32: ICMP echo request, id 26119, seq 1, length 64
00:1e:4f:ad:41:58 > 00:23:ae:93:d9:26, ethertype IPv4 (0x0800), length 98: 10.20.129.32 > 10.20.129.19: ICMP echo reply, id 26119, seq 1, length 64
....(截掉).....

ICMP包在ethernet层之上, 需要使用ethernet发数据, 需要物理地址. 为了得到物理地址使用到ARP协议.

ARP过程与如下命令一致: $ sudo arping -I eth0 10.20.129.32这一条命令表示向网段内查询某IP对应的MAC地址.
查看ARP表:
$ arp -n
Address                  HWtype HWaddress           Flags Mask            Iface
10.20.129.1              ether   00:0F:E2:D3:BE:B8   C                     eth0
10.20.129.32             ether   00:1E:4F:AD:41:58   C                     eth0
增加了一个记录

arping命令 -A 参数: ARP announcement, 也称为gratuitous ARP

$ sudo arping -A -c 3 -I eth0 10.20.129.19
tcpdump的抓包结果:
00:23:ae:93:d9:26 > Broadcast, ethertype ARP (0x0806), length 42: arp reply 10.20.129.19 is-at 00:23:ae:93:d9:26
00:23:ae:93:d9:26 > Broadcast, ethertype ARP (0x0806), length 42: arp reply 10.20.129.19 is-at 00:23:ae:93:d9:26
00:23:ae:93:d9:26 > Broadcast, ethertype ARP (0x0806), length 42: arp reply 10.20.129.19 is-at 00:23:ae:93:d9:26

从上面的信息看出, -A是向整个网段通知自己的IP. 默认情况下, linux 不会接受这样的包.
由arp_accept选项控制, 如下文档:

arp_accept - BOOLEAN
    Define behavior for gratuitous ARP frames who's IP is not
    already present in the ARP table:
    0 - don't create new entries in the ARP table
    1 - create new entries in the ARP table

如果看知道 gratuitous ARP 包的具体用法，可以移步到： http://wiki.wireshark.org/Gratuitous_ARP

arping命令 -D 参数: Duplicate address detection mode (DAD)

这个参数相当有用: 用于排除网段中有IP冲突. 来个实例:

root@jessinio-laptop:~# ifconfig wlan0 |head -n 2
wlan0 Link encap:Ethernet HWaddr 00:16:cf:68:5b:a7
inet addr:192.168.0.106 Bcast:192.168.0.255 Mask:255.255.255.0

root@jessinio-laptop:~# arping -D -I wlan0 192.168.0.106
ARPING 192.168.0.106 from 0.0.0.0 wlan0
Unicast reply from 192.168.0.106 [00:18:41:FE:26:5F] 90.390ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)

可以看出, 192.168.0.106 被两台机器使用, 一台是本志的00:16:cf:68:5b:a7 , 另一台是00:18:41:FE:26:5F.

抓包信息:
00:16:cf:68:5b:a7 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.0.106 (ff:ff:ff:ff:ff:ff) tell 0.0.0.0, length 28
00:18:41:fe:26:5f > 00:16:cf:68:5b:a7, ethertype ARP (0x0806), length 42: Reply 192.168.0.106 is-at 00:18:41:fe:26:5f, length 28

结束

以一个问题为结束: 使用ICMP协议能否得知网段中有其它机器使用自己的IP呢? 比如, ping自己的IP.

答案是不可以的. 因为ICMP包基本没有发出来. 回流了. 例如:

$ ping 10.20.129.19
产生的数据包不会流过ethernet卡, 从route表就可以知道:

$ ip route list table local
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1
broadcast 10.20.129.0 dev eth0 proto kernel scope link src 10.20.129.19
local 10.20.129.19 dev eth0 proto kernel scope host src 10.20.129.19
broadcast 10.20.129.127 dev eth0 proto kernel scope link src 10.20.129.19
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1

Friday, November 26, 2010

iptables与流量统计

要对机房某台机器进行临时性的内、外网流量统计分开。这种事最好不要放到应用层统计，因为效率很成问题。第一时间想到处于网络二、三层的工具，效率不会被过多影响。如ntop之类使用libpcap库的工具。
其实iptables也是有包统计。因为每个包都经过它。而且不用安装任何工具。

加入两条规则：
jessinio@jessinio-laptop:~$ sudo iptables -t filter -A INPUT -p all -s 174.121.79.132 -j ACCEPT
jessinio@jessinio-laptop:~$ sudo iptables -t filter -A OUTPUT -p all -d 174.121.79.132 -j ACCEPT

情况：
jessinio@jessinio-laptop:~$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all -- web124.webfaction.com anywhere

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all -- anywhere             web124.webfaction.com

結果：
jessinio@jessinio-laptop:~$ sudo iptables -L -n -v
Chain INPUT (policy ACCEPT 11M packets, 5033M bytes)
pkts bytes target     prot opt in     out     source               destination
   10 2088 ACCEPT     all -- *      *       174.121.79.132       0.0.0.0/0

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 14M packets, 14G bytes)
pkts bytes target     prot opt in     out     source               destination
   48 25152 ACCEPT     all -- *      *       0.0.0.0/0            174.121.79.132

Thursday, November 18, 2010

seteuid

# ps axjf|grep -v grep|grep ftp
    1 13871 13871 13871 ?           -1 Ss       0   0:00 /usr/sbin/vsftpd /etc/vsftpd/vsftpd.conf
13871 14146 14146 14146 ?           -1 Ss      99   0:00 \_ /usr/sbin/vsftpd /etc/vsftpd/vsftpd.conf
14146 14148 14146 14146 ?           -1 S      509   0:00      \_ /usr/sbin/vsftpd /etc/vsftpd/vsftpd.conf

id是99进程可以产生id为509的进程，原来fork后，子进程还是可以调用seteuid切到其它uid的，这之前还不知道。如下测试代码：

#!/usr/bin/python
#coding:utf-8

import os
import time

# 同一进程可以从0转成其它的
os.seteuid(99)
os.seteuid(0)

os.seteuid(99)
pid = os.fork()
# child
if pid == 0:
    # 子进程还是可以使用seteuid回到0的
    os.seteuid(0)
    time.sleep(10)
else:
    print pid
    time.sleep(10)

运行结果：

[jessinio@niowork tmp]$ ps axuf|grep root.py
nobody   28559 0.0 0.0 74192 2912 pts/11   T    13:56   0:00 |   \_ python root.py
root     28560 0.0 0.0 74188 1756 pts/11   T    13:56   0:00 |   |        \_ python root.py

Thursday, November 4, 2010

使用strace找出问题根本点

今天想备份一个svn仓库。但是碰到“鬼”了。如下：
$ sudo -u daemon HOME=/tmp /usr/local/subversion/bin/svnsync sync file:///data/repos/xxoo
svnsync: Revprop change blocked by pre-revprop-change hook (exit code 255) with no output.

看上去是svn的pre-revprop-change hook出了问题。但是无论我怎么折腾pre-revprop-change hook代码都不给力。

又怀疑是环境变量，连sudo的-E都使用上了还是一个屁用。

无奈下只能使用strace看看：
$ sudo -u daemon HOME=/tmp strace -f /usr/local/subversion/bin/svnsync sync file:///data/repos/xxoo 2>&1 |less

一定要使用-f，因为svnsync产生了子进程，这样可以连子进程都可以被trace。

看到一句相当可疑的trace条目：
[pid 6548] chdir(".") = -1 EACCES (Permission denied)

呃。。。。。细节。原来work directory是~, 切换成daemon运行的svnsync没有权限。。。。。

这样就没有问题了：
$ cd / && sudo -u daemon HOME=/tmp /usr/local/subversion/bin/svnsync sync file:///data/repos/xxoo

真是细节。

Thursday, October 14, 2010

devfs与udev

/dev 目录disk-based还是kernel-based的文件系统?
这个问题一直没有分开.
从自己的实践和<<Linux操作系统之奥秘>>一书, 明显/dev是disk-based的. 本人没有使用过devfs

linux2.4 kernel时代使用devfs文件系统. linux 2.6 kernel已经去掉devfs代码了.

想查查devfs的文档都不容易, 它的作者旧blog上的文章都找不到了.

google到
* http://www.linuxjournal.com/article/6035
* http://www.ibm.com/developerworks/linux/library/l-devfs.html
从时间上看很老的, 为了确认年代, 查看了kernel的timeline, 可以确认这是"同一时代"的产物:
* http://en.wikipedia.org/wiki/Linux_kernel#Timeline

没有用过devfs. 为了求真相, 问朋友借了个redhat8环境的linux, 结果devfs没有被默认编入kernel:
# cat /boot/config-2.4.18-14 |grep DEVFS
# CONFIG_DEVFS_FS is not set

但是redhat8下的/dev的确有18 thousand entries (比较惊人)
/dev 目录disk-based还是kernel-based的文件系统? 这个问题没有环境都没法亲眼看到. 主能找到的文档里找, 有些文件是使用pseudo filesystem称呼devfs的(例如这篇: http://www.linux.org/docs/ldp/howto/SCSI-2.4-HOWTO/devfs.html)

在 http://www.linuxjournal.com/article/6035 一文中, 描述了使用devfs的好处:
1. 系统自动管理/dev下的文件
2. 可以被read only 的方法mount到系统和/dev创建在non-unix file system上

文档特殊提到了non-unix file system. 因为dev entry是还有一些额外的信息. 下面在fat32文件系统上创建一个dev entry:

[jessinio@niowork NO_NAME]$ sudo mknod dev_entry c 240 1
mknod: `dev_entry': Operation not permitted

可以确定devfs一个存在于内存的文件系统.

devfs已经被淘汰, 被udev取代. udev比devfs带来什么好处?
如果想仔细了解有什么优点, 请阅读udev作者写的文章: http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev_vs_devfs

比较深刻的是两条:
1. 可以随意命名设备文件在/dev下的entry名
2. allow everyone to not care about major/minor numbers

到这里, 又需要把视线转移到sysfs上面了.

Monday, October 11, 2010

足够数目的getdents调用与文件数目引发的问题

网站的速度很慢。要求给个理由。于是登机top了一把。如下

PID USER      PR NI VIRT RES SHR S %CPU %MEM    TIME+ COMMAND
29131 liangqin 16   0 15068 3444 816 R 11.7 0.0   0:00.72 top
16371 kmmaster 16   0 181m 8484 3128 D 5.2 0.1   0:01.20 httpd
5726 kmmaster 15   0 182m 9352 3564 S 3.9 0.1   0:02.24 httpd
32548 kmmaster 16   0 183m 9.8m 3348 D 3.9 0.1   0:06.16 httpd
6199 kmmaster 16   0 183m 9444 3452 D 2.6 0.1   0:01.89 httpd
7697 kmmaster 16   0 182m 8848 3184 D 2.6 0.1   0:01.94 httpd
10536 kmmaster 16   0 181m 8692 3332 D 2.6 0.1   0:01.93 httpd
15102 kmmaster 16   0 181m 8420 3060 D 2.6 0.1   0:01.37 httpd
17993 kmmaster 16   0 181m 8708 3292 D 2.6 0.1   0:04.16 httpd
23185 kmmaster 16   0 182m 9420 3428 D 2.6 0.1   0:00.61 httpd
30189 kmmaster 16   0 181m 8724 3308 D 2.6 0.1   0:05.91 httpd
30337 kmmaster 16   0 183m 9.9m 3448 D 2.6 0.1   0:02.71 httpd

使用strace命令查看D状态的httpd, 都是调用getdents, stat, unlink等IO函数，如：
stat("/tmp/sess_bc45d5d1dd8739acceff8a3e0fec0585", {st_mode=S_IFREG|0600, st_size=0, ...}) = 0

使用ls， find之类的工具都无法对此目录(指/tmp)进行数据查看。行为freeze。也是进入D状态。

本想使用python的os.listdir函数的。但这个函数是读完目录的entry后才返回list的。也是要慢很长时间。

于是使用下面的C代码：
#include <sys/types.h>
#include <dirent.h>
#include <stdio.h>

int main(int argc, char * argv[])
{
        DIR *dirp = opendir("/tmp");
        struct dirent *retval;
        long long int t;
        for(; ; ){
                retval = readdir(dirp);
                if (retval == NULL) {break;}
                else {printf("%s\n", retval->d_name); t++;}
        }
        printf("%lld\n", t);
}
可以知道目录文件总数是 221346

httpd的进程数有3K！每个需要session的进程都要读/tmp目录下的entry，这个行为会对3K数目的httpd进程有很大的影响吗？

于是自己写了个测试代码，目的就是测试众多的readdir函数是否对进程有影响：
#coding:utf-8
import time
import os

for i in range(300):
    pid = os.fork()
    if pid > 0:
        break

for r in range(30):
    time.sleep(0.1)
    os.listdir("/tmp")

代码很简单，但是对于200K条目的directory来说，很给力！！出现大量的D状态进程：
(........被截去......)
905       7072 0.1 0.0 84244 11628 pts/2    D+   14:45   0:00 python listdir.py
905       7073 0.1 0.1 85136 12448 pts/2    D+   14:45   0:00 python listdir.py
905       7074 0.1 0.1 85136 12628 pts/2    D+   14:45   0:00 python listdir.py
905       7075 0.1 0.1 86304 13588 pts/2    D+   14:45   0:00 python listdir.py
905       7076 0.1 0.0 84764 12264 pts/2    D+   14:45   0:00 python listdir.py
905       7077 0.1 0.0 84504 12036 pts/2    D+   14:45   0:00 python listdir.py
905       7078 0.1 0.1 85136 12672 pts/2    D+   14:45   0:00 python listdir.py
905       7079 0.1 0.1 85916 13360 pts/2    D+   14:45   0:00 python listdir.py
905       7080 0.1 0.1 85916 13448 pts/2    D+   14:45   0:00 python listdir.py
905       7081 0.2 0.1 89728 16964 pts/2    D+   14:45   0:00 python listdir.py
905       7082 0.2 0.1 88268 15776 pts/2    D+   14:45   0:00 python listdir.py
905       7083 0.2 0.1 88268 15596 pts/2    D+   14:45   0:00 python listdir.py
905       7084 0.2 0.1 87344 14820 pts/2    D+   14:45   0:00 python listdir.py
905       7085 0.1 0.1 85136 12628 pts/2    D+   14:45   0:00 python listdir.py
905       7086 0.2 0.1 86304 13812 pts/2    D+   14:45   0:00 python listdir.py
(......被截去.......)

机器的内存使用量快速上升。