Archive

Author Archive

Linux下谁在切换我们的进程

October 8th, 2010 14 comments

我们在做Linux服务器的时候经常会需要知道谁在做进程切换,什么原因需要做进程切换。 因为进程切换的代价很高,我给出一个LMbench测试出来的数字:
Context switching – times in microseconds – smaller is better
————————————————————————-
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
——— ————- —— —— —— —— —— ——- ——-
my174.cm4 Linux 2.6.18- 6.1100 7.0200 6.1100 8.7400 7.7200 8.96000 9.62000

在我的很高端的服务器上,进程切换的开销在8us左右, 这个相对于高性能的服务器是不可接受的, 所以我们要在一个时间片内尽可能的多做事情,而不是把时间浪费在无谓的切换上。

好奇害死猫,我们来调查下谁在切换我们的进程:

[root@my174 admin]# dstat 1
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  0   0 100   0   0   0|   0     0 | 796B 1488B|   0     0 |1004   128 
  0   0 100   0   0   0|   0     0 | 280B  728B|   0     0 |1005   114 
  0   0 100   0   0   0|   0     0 | 280B  728B|   0     0 |1005   128 
  0   0 100   0   0   0|   0     0 | 280B  728B|   0     0 |1005   114 
  0   0 100   0   0   0|   0   320k| 280B  728B|   0     0 |1008   143 
...

我们可以看到 csw的数目是 120/S, 但是dstat或者vmstat类似的工具并没有告诉我们谁在干坏事。好吧!我们自己动手行吧。
祭出我们可爱的systemtap!

[root@my174 admin]# cat >cswmon.stp
#! /usr/bin/env stap
#
#

global csw_count
global idle_count

probe scheduler.cpu_off {
  csw_count[task_prev, task_next]++
  idle_count+=idle
}


function fmt_task(task_prev, task_next)
{
   return sprintf("%s(%d)->%s(%d)",
                                task_execname(task_prev), 
                                task_pid(task_prev), 
                                task_execname(task_next), 
                                task_pid(task_next))
}

function print_cswtop () {
  printf ("%45s %10s\n", "Context switch", "COUNT")
  foreach ([task_prev, task_next] in csw_count- limit 20) {
    printf("%45s %10d\n", fmt_task(task_prev, task_next), csw_count[task_prev, task_next])
  }
  printf("%45s %10d\n", "idle", idle_count)

  delete csw_count
  delete idle_count
}

probe timer.s($1) {
  print_cswtop ()
  printf("--------------------------------------------------------------\n")
}
CTRL+D

这个脚本会每隔设定的时间打印出TOP 20切换最多的进程和他的pid, 我们来看下结果把:

[root@my174 admin]# stap cswmon.stp 5
                               Context switch      COUNT
                swapper(0)->systemtap/11(908)        500
                systemtap/11(908)->swapper(0)        498
                swapper(0)->fct1-worker(2492)         50
                fct1-worker(2492)->swapper(0)         50
                swapper(0)->fct0-worker(2191)         50
                fct0-worker(2191)->swapper(0)         50
                      swapper(0)->bond0(3432)         50
                      bond0(3432)->swapper(0)         50
                      stapio(879)->swapper(0)         26
                      swapper(0)->stapio(879)         25
                      stapio(879)->swapper(0)         19
                      swapper(0)->stapio(879)         17
                   swapper(0)->watchdog/9(31)          5
                   watchdog/9(31)->swapper(0)          5
                    swapper(0)->mysqld(18346)          5
                    mysqld(18346)->swapper(0)          5
                  swapper(0)->watchdog/13(43)          5
                  watchdog/13(43)->swapper(0)          5
                  swapper(0)->watchdog/14(46)          5
                  watchdog/14(46)->swapper(0)          5
                                         idle        859
--------------------------------------------------------------
...

我们可以看到进程从哪里切换到哪里,并且发生了多少次, 最后一行,我打印出来idle的次数,也就是说这时候系统没啥事情做,就切换到idle(0)这个进程去休息去了。

通过上面的调查,我们会很清楚的了解到我们系统的开销发生在那里,方便我们定位问题。
玩的开心!

Linux下谁在消耗我们的cache

September 25th, 2010 20 comments

Linux下对文件的访问和设备的访问通常会被cache起来加快访问速度,这个是系统的默认行为。 而cache需要耗费我们的内存,虽然这个内存最后可以通过echo 3>/proc/sys/vm/drop_caches这样的命令来主动释放。但是有时候我们还是需要理解谁消耗了我们的内存。

我们来先了解下内存的使用情况:

[root@my031045 ~]# free
             total       used       free     shared    buffers     cached
Mem:      24676836     626568   24050268          0      30884     508312
-/+ buffers/cache:      87372   24589464
Swap:      8385760 

有了伟大的systemtap, 我们可以用stap脚本来了解谁在消耗我们的cache了:

#这个命令行用来调查谁在加数据入page_cache
[root@my031045 ~]# stap -e 'probe vfs.add_to_page_cache {printf("dev=%d, devname=%s, ino=%d, index=%d, nrpages=%d\n", dev, devname, ino, index, nrpages )}'
...
dev=2, devname=N/A, ino=0, index=2975, nrpages=1777
dev=2, devname=N/A, ino=0, index=3399, nrpages=2594
dev=2, devname=N/A, ino=0, index=3034, nrpages=1778
dev=2, devname=N/A, ino=0, index=3618, nrpages=2595
dev=2, devname=N/A, ino=0, index=1694, nrpages=106
dev=2, devname=N/A, ino=0, index=1703, nrpages=107
dev=2, devname=N/A, ino=0, index=1810, nrpages=210
dev=2, devname=N/A, ino=0, index=1812, nrpages=211
...

这时候我们拷贝个大文件:

[chuba@my031045 ~]$ cp huge_foo.file  bar

#这时候我们可以看到文件的内容被猛的添加到cache去:
...
dev=8388614, devname=sda6, ino=2399271, index=39393, nrpages=39393
dev=8388614, devname=sda6, ino=2399271, index=39394, nrpages=39394
dev=8388614, devname=sda6, ino=2399271, index=39395, nrpages=39395
dev=8388614, devname=sda6, ino=2399271, index=39396, nrpages=39396
dev=8388614, devname=sda6, ino=2399271, index=39397, nrpages=39397
dev=8388614, devname=sda6, ino=2399271, index=39398, nrpages=39398
dev=8388614, devname=sda6, ino=2399271, index=39399, nrpages=39399
dev=8388614, devname=sda6, ino=2399271, index=39400, nrpages=39400
dev=8388614, devname=sda6, ino=2399271, index=39401, nrpages=39401
dev=8388614, devname=sda6, ino=2399271, index=39402, nrpages=39402
dev=8388614, devname=sda6, ino=2399271, index=39403, nrpages=39403
dev=8388614, devname=sda6, ino=2399271, index=39404, nrpages=39404
dev=8388614, devname=sda6, ino=2399271, index=39405, nrpages=39405
dev=8388614, devname=sda6, ino=2399271, index=39406, nrpages=39406
dev=8388614, devname=sda6, ino=2399271, index=39407, nrpages=39407
dev=8388614, devname=sda6, ino=2399271, index=39408, nrpages=39408
dev=8388614, devname=sda6, ino=2399271, index=39409, nrpages=39409
dev=8388614, devname=sda6, ino=2399271, index=39410, nrpages=39410
dev=8388614, devname=sda6, ino=2399271, index=39411, nrpages=39411
...

此外加入我们想了解下系统的cache都谁在用呢, 那个文件用到多少页了呢?
我们有个脚本可以做到,这里非常谢谢 子团 让我使用他的代码。

[chuba@my031045 ~]# stap -g viewcache.stp

在另外的shell里面 
[chuba@my031045 ~]# dmesg
...
inode: 116397109, num: 5
inode: 116397111, num: 2
inode: 116397112, num: 1
inode: 116397149, num: 2
inode: 116397152, num: 1
inode: 116397336, num: 2
inode: 116397343, num: 1
inode: 116397371, num: 4
inode: 116397372, num: 2
...

非常清楚的看出来每个inode占用了多少页,用工具转换下就知道哪个文件耗费了多少内存。

点击下载viewcache.stp

另外小TIPS:

从inode到文件名的转换
find / -inum your_inode

从文件名到inode的转换
stat -c “%i” your_filename
或者 ls -i your_filename

我们套用了下就马上知道那个文件占用的cache很多。

[chuba@my031045 ~]$ sudo find / -inum 2399248
/home/chuba/kernel-debuginfo-2.6.18-164.el5.x86_64.rpm

玩的开心。

参考资料:
page cache和buffer cache的区别:
这篇文章总结的最靠谱: http://blog.chinaunix.net/u/1595/showart.php?id=2209511

后记:
linux下有个这样的系统调用可以知道页面的状态:mincore – determine whether pages are resident in memory
同时有人作个脚本fincore更方便大家的使用, 点击下载fincore

后来子团告诉我还有这个工具: https://code.google.com/p/linux-ftools/

Fio IO性能测试工具介绍

September 25th, 2010 26 comments

官网:http://freshmeat.net/projects/fio/

fio is an I/O tool meant to be used both for benchmark and stress/hardware verification. It has support for 13 different types of I/O engines (sync, mmap, libaio, posixaio, SG v3, splice, null, network, syslet, guasi, solarisaio, and more), I/O priorities (for newer Linux kernels), rate I/O, forked or threaded jobs, and much more. It can work on block devices as well as files. fio accepts job descriptions in a simple-to-understand text format. Several example job files are included. fio displays all sorts of I/O performance information. It supports Linux, FreeBSD, NetBSD, OS X, and OpenSolaris.

Ubuntu下可以用apt-get install fio安装就好。

这个工具最大的特点是使用简单,支持的文件操作非常多, 可以覆盖到我们能见到的文件使用方式:
sync:Basic read(2) or write(2) I/O. fseek(2) is used to position the I/O location.
psync:Basic pread(2) or pwrite(2) I/O.
vsync: Basic readv(2) or writev(2) I/O. Will emulate queuing by coalescing adjacents IOs into a single submission.
libaio: Linux native asynchronous I/O.
posixaio: glibc POSIX asynchronous I/O using aio_read(3) and aio_write(3).
mmap: File is memory mapped with mmap(2) and data copied using memcpy(3).
splice: splice(2) is used to transfer the data and vmsplice(2) to transfer data from user-space to the kernel.
syslet-rw: Use the syslet system calls to make regular read/write asynchronous.
sg:SCSI generic sg v3 I/O.
net : Transfer over the network. filename must be set appropriately to `host/port’ regardless of data direction. If receiving,
only the port argument is used.
netsplice: Like net, but uses splice(2) and vmsplice(2) to map data and send/receive.
guasi The GUASI I/O engine is the Generic Userspace Asynchronous Syscall Interface approach to asycnronous I/O.

还可以控制io depth对于测试磁盘的性能很有帮助,对结果的解读也做的很明白。

典型的使用如下:
fio –filename=/dev/sdc1 –direct=1 –rw=randread –bs=4k –size=60G –numjobs=64 –runtime=10 –group_reporting –name=fileXXX

玩的开心。

Categories: 工具介绍 Tags: ,

nmon(Linux下很好用的性能监测工具)介绍

September 25th, 2010 Comments off

The nmon tool is designed for AIX and Linux performance specialists to use for monitoring and analyzing performance data, including:
* CPU utilization
* Memory use
* Kernel statistics and run queue information
* Disks I/O rates, transfers, and read/write ratios
* Free space on file systems
* Disk adapters
* Network I/O rates, transfers, and read/write ratios
* Paging space and paging rates
* CPU and AIX specification
* Top processors
* IBM HTTP Web cache
* User-defined disk groups
* Machine details and resources
* Asynchronous I/O — AIX only
* Workload Manager (WLM) — AIX only
* IBM TotalStorage® Enterprise Storage Server® (ESS) disks — AIX only
* Network File System (NFS)
* Dynamic LPAR (DLPAR) changes — only pSeries p5 and OpenPower for either AIX or Linux

Ubuntu下可以用 apt-get -y install nmon安装就好。 这个工具的最大特点是日常所需要的性能监测数据都有了,而且图形化表示,很容易解读而且信息很丰富。

具体的可以参考文章:http://www.ibm.com/developerworks/aix/library/au-analyze_aix/ 里面有截屏,很清楚。

祝大家玩的开心。

Categories: 工具介绍 Tags: ,

CPU拓扑结构的调查

September 25th, 2010 4 comments

在做多核程序的时候(比如Erlang程序),我们需要了解cpu的拓扑结构, 了解logic CPU和物理的CPU的映射关系,以及了解CPU的内部的硬件参数,比如说
L1,L2 cache的大小等信息。

Linux下的/proc/cpuinfo提供了相应的信息,但是比较不全面。 /sys/devices/system/cpu/也提供了topology结构但是比较难解读。

很多时候我们需要更专业的工具了。intel提供了这样的救助。参见: http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/

下载下来编译执行就好。

[admin@my174 cpu-topology]$ ./cpu_topology64.out
Read more…

Categories: 工具介绍 Tags: ,

gcc mudflap 用来检测内存越界的问题

September 25th, 2010 5 comments

参考资料:http://www.redhat.com/magazine/015jan06/features/valgrind/
http://www.stlinux.com/devel/debug/mudflap

我们用C语言在做大型服务器程序的时候,不可避免的要面对内存错误的问题。典型的问题是内存泄漏,越界,随机乱写等问题。 在linux下valgrind是个很好的工具,大部分问题都可以查的到的。但是对于更微妙的越界问题,valgrind有时候也是无能为力的。比如下面的问题。

[admin@my174 ~]$ cat bug.c

int a[10];
int b[10];
int main(void) {
   return a[11];
}
[admin@my174 ~]$ gcc -g -o bug bug.c
[admin@my174 ~]$ valgrind ./bug
==5791== Memcheck, a memory error detector.
==5791== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.
==5791== Using LibVEX rev 1658, a library for dynamic binary translation.
==5791== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
==5791== Using valgrind-3.2.1, a dynamic binary instrumentation framework.
==5791== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
==5791== For more details, rerun with: -v
==5791== 
==5791== 
==5791== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 1)
==5791== malloc/free: in use at exit: 0 bytes in 0 blocks.
==5791== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
==5791== For counts of detected errors, rerun with: -v
==5791== All heap blocks were freed -- no leaks are possible.
[admin@my174 ~]$ 

valgrind报告一切安好。

[admin@my174 ~]$ gcc -o bug bug.c -g -fmudflap -lmudflap
[admin@my174 ~]$ ./bug
*******
mudflap violation 1 (check/read): time=1285386334.204054 ptr=0x700e00 size=48
pc=0x2b6c3013c4c1 location=`bug.c:5 (main)'
      /usr/lib64/libmudflap.so.0(__mf_check+0x41) [0x2b6c3013c4c1]
      ./bug(main+0x7a) [0x400952]
      /lib64/libc.so.6(__libc_start_main+0xf4) [0x39ea21d994]
Nearby object 1: checked region begins 0B into and ends 8B after
mudflap object 0x16599370: name=`bug.c:1 a'
bounds=[0x700e00,0x700e27] size=40 area=static check=3r/0w liveness=3
alloc time=1285386334.204025 pc=0x2b6c3013bfe1
number of nearby objects: 1

mudflap就很顺利的检查出来了。

[admin@my174 ~]$ gcc -v

gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)

当然我们的这个例子很简单,典型的服务器要比这个复杂很多, 而且mudflap的运行开销也非常高,我们在定位此类bug的时候不妨实验下。
Have fun!

Categories: 工具介绍 Tags: , ,

iozone文件系统性能测试工具

September 24th, 2010 2 comments

IOzone官网: http://www.iozone.org/
IOzone is a filesystem benchmark tool. The benchmark generates and measures a variety of file operations. Iozone has been ported to many machines and runs under many operating systems.

Benchmark Features:
* ANSII C source
* POSIX async I/O
* Mmap() file I/O
* Normal file I/O
* Single stream measurement
* Multiple stream measurement
* Distributed fileserver measurements (Cluster)
* POSIX pthreads
* Multi-process measurement
* Excel importable output for graph generation
* Latency plots
* 64bit compatible source
* Large file compatible
* Stonewalling in throughput tests to eliminate straggler effects
* Processor cache size configurable
* Selectable measurements with fsync, O_SYNC
* Builds for: AIX, BSDI, HP-UX, IRIX, FreeBSD, Linux, OpenBSD, NetBSD, OSFV3, OSFV4, OSFV5, SCO OpenServer, Solaris, MAC OS X, Windows (95/98/Me/NT/2K/XP)

他的定位非常明确是针对文件系统的性能测试的。和常用的IO性能测试工具sysbench, fio, iometer不同, 它主要是通过模拟用户访问文件模式的不同,典型的如下面的几种:
(0=write/rewrite, 1=read/re-read, 2=random-read/write
3=Read-backwards, 4=Re-write-record, 5=stride-read, 6=fwrite/re-fwrite
7=fread/Re-fread, 8=random_mix, 9=pwrite/Re-pwrite, 10=pread/Re-pread
11=pwritev/Re-pwritev, 12=preadv/Re-preadv)
来达到隔离访问文件系统的meta信息和data信息的不同的开销, 从而反应文件系统的性能。

Ubuntu下可以用 apt-get -y install iozone安装就好。

他有二种模式: 1. 测试吞吐量模式。 2. 测试文件系统对记录大小,文件大小不同组合的反应。

以下是我用过的测试吞吐量模式的参数:
iozone -t -l 1 -u 16 -L 64 -S 8192 -b fio.xls -R -M -s 10G -r 32k -I -T -C -j 32 -+p 60
参数解释
-t -> Throughput test
-s 10G -> File size set to 18874368 KB
-M ->Machine = Linux my174.cm4 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009
-r ->32k Record Size 32 KB
-I ->O_DIRECT feature enabled
-S 8192 ->Processor cache size set to 8192 Kbytes.
-L 64 ->Processor cache line size set to 64 bytes.
-j 32 ->File stride size set to 32 * record size.
-l 1 ->Min thread = 1
-u 16 ->Max thread = 16
-R ->Excel chart generation enabled
-b fio.xls ->产生的二进制格式execl文件名
-+p 60 ->Percent read in mix test is 60

测试文件系统对记录大小,文件大小不同组合的反应时候的参数:
TODO

玩的开心。

Categories: 工具介绍 Tags: , , ,