源码分析 | 系统技术非业余研究

开源压缩算法Zopfli介绍

March 1st, 2013 Yu Feng 1 comment

原创文章，转载请注明： 转载自系统技术非业余研究

谷歌近日推出了全新开源压缩算法Zopfli, 官方主页在这里，相关文档在这里

Zopfli is a new deflate compatible compressor that was inspired by compression improvements
developed originally for the lossless mode of WebP image compression. Being compatible with
deflate makes Zopfli compatible with zlib and gzip. Most internet browsers support deflate
decompression, and it has a wide range of other applications. This means that Zopflicompatible
decompression is readily widely available.

二个特点：
1. The output produced by Zopfli is 3.7–8.3 % smaller than that of gzip 9.
2. Zopfli is 81 times slower than the fastest measured algorithm gzip 9.

最大的特点是压缩好的数据和zip兼容，也就是说目前标准的zip uncompress算法都能解开，看起来比较适合web服务器的数据存储，降低成本，虽然只有3-8%点的提高，但是数据规模大了，还是很可观的。

下载源码，编译得到zopfli：

$ ./zopfli  -h
Usage: zopfli [OPTION]... FILE
  -h    gives this help
  -c    write the result on standard output, instead of disk filename + '.gz'
  -v    verbose mode
  --gzip  output to gzip format (default)
  --deflate  output to deflate format instead of gzip
  --zlib  output to zlib format instead of gzip
  --i5  less compression, but faster
  --i10  less compression, but faster
  --i15  default compression, 15 iterations
  --i25  more compression, but slower
  --i50  more compression, but slower
  --i100  more compression, but slower
  --i250  more compression, but slower
  --i500  more compression, but slower
  --i1000  more compression, but slower

祝玩得开心。

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: 工具介绍, 源码分析 Tags: zopfli

nicstat 网络流量统计利器

February 27th, 2013 Yu Feng 6 comments

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: nicstat 网络流量统计利器

前段时间看到brendangregg的 Linux Performance Analysis and Tools PPT里面提到的nicstat，研究了下是个不错的东西，分享给大家。

nicstat is to network interfaces as “iostat” is to disks, or “prstat” is to processes.

nicstat原本是Solaris平台下显示网卡流量的工具，Tim Cook将它移植到linux平台，官方网站见这里。相比netstat, 他有以下关键特性：

Reports bytes in & out as well as packets.
Normalizes these values to per-second rates.
Reports on all interfaces (while iterating)
Reports Utilization (rough calculation as of now)
Reports Saturation (also rough)
Prefixes statistics with the current time

我们来体验下，首先安装之，源码在这里下，目前最新的版本是1.92。
解开后，由于这个版本默认是在32位linux下编译，所以需要改下Makefile.Linux：

$ uname -r
2.6.32-131.21.1.tb477.el6.x86_64
$ diff Makefile.Linux64 Makefile.Linux
17c17
< CFLAGS =      $(COPT) -m32
---
> CFLAGS =      $(COPT)

$ sudo make -f Makefile.Linux install  
sudo install -o root -g root -m 4511 `./nicstat.sh --bin-name` /usr/local/bin/nicstat
sudo install -o bin -g bin -m 555 enicstat /usr/local/bin
sudo install -o bin -g bin -m 444 nicstat.1 /usr/local/share/man/man1/nicstat.1

enicstat就安装好可以使用了。

使用文档在这里： man nicstat
由于在linux下需要获取网卡的speed等信息，需要以特权用户运行。
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Linux, 工具介绍, 源码分析 Tags: nicstat, 网络流量

mmap的MAP_POPULATE标志妙用

January 19th, 2013 Yu Feng 1 comment

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: mmap的MAP_POPULATE标志妙用

前段时间在学习MySQL和内核的时候，碰巧遇到twitter的dba老大在折腾numa不平衡问题的问题而写了二篇numa相关的文章，非常有技术含量。参见这里

numa不平衡问题的解决方案摘抄如下：

A more thorough solution

The original post also only addressed only one part of the solution: using interleaved allocation.
A complete and reliable solution actually requires three things, as we found when implementing this change for production systems at Twitter:

1. Forcing interleaved allocation with numactl –interleave=all. This is exactly as described previously, and works well.

2. Flushing Linux’s buffer caches just before mysqld startup with sysctl -q -w vm.drop_caches=3. This helps to ensure allocation fairness, even if the daemon is
restarted while significant amounts of data are in the operating system buffer cache.

3. Forcing the OS to allocate InnoDB’s buffer pool immediately upon startup, using MAP_POPULATE where supported (Linux 2.6.23+), and falling back to memset otherwise. This forces the NUMA node allocation decisions to be made immediately, while the buffer cache is still clean from the above flush.

具体的代码实现参看这里
其中就提到了如何用mmap的MAP_POPULATE来达到匿名页预先分配的问题，这是非常好的思路。

我们来man mmap看下:

MAP_POPULATE (since Linux 2.5.46)
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on
the file. Later accesses to the mapping will not be blocked by page faults. MAP_POPULATE is
only supported for private mappings since Linux 2.6.23.

这个标志很早就有了，但是其实很少程序用到了这个特性，去年分析内核内存系统实现的时候，重点读过这块代码。这个特性在特殊的场景下还是挺有好处的。

我们知道通常我们mmap出来的内存，要不是匿名页面，要不就是文件的映射。当访问这块线性地址的时候，如果需要的页面不在内存中，就会发生缺页中断，内核分配物理内存，如果是文件后背的话，顺手把文件读进来。这样在高性能服务器里面分配内存的动作就会成为问题。

问题主要体现在2点：
1. 内存分配的时候，系统的内存已经比较乱了，不知道系统会从那个numa节点去分配，而且极端的时候，发生内存短缺，会换出内存页面，这个时间非常不可控。内存的分配也无法准确的指定。
2. 读文件这个时间非常不可控，系统可能会被挂起等待IO动作完成。

如果我们能够在系统内存还比较干净的时候，比如刚开机或者刚做完vm.drop_caches=3的时候，去把我们需要的内存或者数据预先按照我们设想的方式来准备，虽然这个集中化的动作会化很长的时间，但是换来的是后续的可控性。

mmap的MAP_POPULATE标志使用代码参见这里：

如果你的系统没有这个函数，使用memset(ptr, ‘\0’, size);也是个好的方案。

总结：高性能服务器细节多，技术含量高！

祝玩的开心！

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Linux, 源码分析 Tags: MAP_POPULATE, memset, mmap, twitter

Linux Used内存到底哪里去了？

January 19th, 2013 Yu Feng 44 comments

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: Linux Used内存到底哪里去了？

前几天纯上同学问了一个问题：

我ps aux看到的RSS内存只有不到30M，但是free看到内存却已经使用了7,8G了，已经开始swap了，请问ps aux的实际物理内存统计是不是漏了哪些内存没算？我有什么办法确定free中used的内存都去哪儿了呢？

这个问题不止一个同学遇到过了，之前子嘉同学也遇到这个问题，内存的计算总是一个迷糊账。我们今天来把它算个清楚下!

通常我们是这样看内存的剩余情况的：

$free -m
             total       used       free     shared    buffers     cached
Mem:         48262       7913      40349          0         14        267
-/+ buffers/cache:       7631      40631
Swap:         2047        336       1711

那么这个信息是如何解读的呢，以下这个图解释的挺清楚的！

补充（不少人反映图不清晰，请参考：http://www.redbooks.ibm.com/redpapers/pdfs/redp4285.pdf P46-47)

上面的情况下我们总的内存有48262M，用掉了7913M。其中buffer+cache总共14+267=281M, 由于这种类型的内存是可以回收的，虽然我们用掉了7913M，但是实际上我们如果实在需要的话，这部分buffer/cache内存是可以放出来的。

我们来演示下：

$ sudo sysctl vm.drop_caches=3
vm.drop_caches = 3
$ free -m
             total       used       free     shared    buffers     cached
Mem:         48262       7676      40586          0          3         41
-/+ buffers/cache:       7631      40631
Swap:         2047        336       1711

我们把buffer/cache大部分都清除干净了，只用了44M，所以我们这次used的空间是7676M。
到现在我们比较清楚几个概念：
1. 总的内存多少
2. buffer/cache内存可以释放的。
3. used的内存的概率。

即使是这样我们还是要继续追查下used的空间（7637M)到底用到哪里去了？
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Linux, 源码分析 Tags: free, pagetable, rss, slabinfo, statm

深度剖析告诉你irqbalance有用吗？

January 17th, 2013 Yu Feng 8 comments

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: 深度剖析告诉你irqbalance有用吗？

irqbalance项目的主页在这里

irqbalance用于优化中断分配，它会自动收集系统数据以分析使用模式，并依据系统负载状况将工作状态置于 Performance mode 或 Power-save mode。处于Performance mode 时，irqbalance 会将中断尽可能均匀地分发给各个 CPU core，以充分利用 CPU 多核，提升性能。
处于Power-save mode 时，irqbalance 会将中断集中分配给第一个 CPU，以保证其它空闲 CPU 的睡眠时间，降低能耗。

在RHEL发行版里这个守护程序默认是开机启用的，那如何确认它的状态呢？

# service irqbalance status
irqbalance (pid PID) is running…

然后在实践中，我们的专用的应用程序通常是绑定在特定的CPU上的，所以其实不可不需要它。如果已经被打开了，我们可以用下面的命令关闭它：

# service irqbalance stop
Stopping irqbalance: [ OK ]

或者干脆取消开机启动：

# chkconfig irqbalance off

下面我们来分析下这个irqbalance的工作原理，好准确的知道什么时候该用它，什么时候不用它。

既然irqbalance用于优化中断分配，首先我们从中断讲起,文章很长，深吸一口气，来吧！

SMP IRQ Affinity 相关东西可以参见这篇文章
摘抄重点：

SMP affinity is controlled by manipulating files in the /proc/irq/ directory.
In /proc/irq/ are directories that correspond to the IRQs present on your
system (not all IRQs may be available). In each of these directories is
the “smp_affinity” file, and this is where we will work our magic.

说白了就是往/proc/irq/N/smp_affinity文件写入你希望的亲缘的CPU的mask码！关于如何手工设置中断亲缘性，请参见我之前的博文：这里这里

接着普及下概念，我们再来看下CPU的拓扑结构，首先看下Intel CPU的各个部件之间的关系：

一个NUMA node包括一个或者多个Socket，以及与之相连的local memory。一个多核的Socket有多个Core。如果CPU支持HT，OS还会把这个Core看成 2个Logical Processor。
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Linux, 源码分析 Tags: irq, irqbalance, smp_affinity, topology

低成本和高性能MySQL云数据的架构探索

October 25th, 2012 Yu Feng 9 comments

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: 低成本和高性能MySQL云数据的架构探索

原文地址：http://www.alibabatech.org/article/detail/3405/0?ticket=d69f07f8-b60b-43f8-9572-7d795bb8429d
作者：鸣嵩
PPT这里下载：

该文已在《程序员》2012年10期上发表。

MySQL作为一个低成本、高性能、可靠性好而且开源的数据库产品，在互联网企业应用非常广泛，例如淘宝网有数千台MySQL服务器的规模。虽然近两年来NoSQL的发展很快，新产品层出不穷，但在业务中应用NoSQL对开发者来说要求比较高，而MySQL拥有成熟的中间件、运维工具，已经形成一个良性的生态圈等，因此从现阶段来看，MySQL占主导性，NoSQL为辅。
在过去一年时间里，我们（阿里集团核心系统数据库团队）在MySQL托管平台方向做了大量工作，设计和实现了一套UMP(Unified MySQL Platform)系统，提供低成本和高性能的MySQL云数据服务。开发者从平台上申请MySQL实例资源，通过平台提供的单一入口来访问数据，UMP系统内部维护和管理资源池，以对用户透明的形式提供主从热备、数据备份、迁移、容灾、读写分离、分库分表等一系列服务。平台通过在一台物理机上运行多个MySQL实例的方式来降低成本，并且实现了资源隔离，按需分配和限制CPU、内存和IO资源，同时支持不影响提供数据服务的前提下根据用户业务的发展动态的扩容和缩容。

架构的演变
UMP系统第一版基于mysql-proxy 0.8版修复若干bug，并对proxy插件中管理用户连接和数据库连接的状态机流程进行一些修改，同时编写Lua脚本实现去中心数据库获取用户认证信息和后台数据库地址，对用户进行验证，建立到后台数据库的连接和转发数据包等逻辑。

图1 UMP系统的第一版采用MySQL Proxy

在开发和部署第一版的过程中，我们逐渐认识到几个问题：
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 杂七杂八, 源码分析 Tags: lvs, mysql, proxy, ump

Linux TASK_IO_ACCOUNTING功能以及如何使用

March 11th, 2012 Yu Feng 4 comments

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: Linux TASK_IO_ACCOUNTING功能以及如何使用

在过去我们了解系统IO的情况大多数是通过iostat来获取的，这个粒度只能精确到每个设备。通常我们会想了解每个进程,线程层面发起了多少IO，在Linux 2.6.20之前除了用systemtap这样的工具来实现是没有其他方法的，因为系统没有暴露这方面的统计。 disktop per设备per应用层面的IO读写统计，可以参考我之前写的，见这里.

透过lxr的代码确认，在Linux 2.6.20以后引入了TASK_IO_ACCOUNTING功能，通过把每个线程和进程的io活动通过/proc/pid/io导出大大方便了用户，这里需要注意的是RHEL 5U4基于2.6.18内核但是他们backport了这个功能，并由此催生了相应的了解per进程Io活动的工具如pidstat和iotop, 这两个软件工作的时候截图如下：

pidstat可以看到带层次线程IO活动

iotop能看到扁平线程IO活动

通过strace来了解到这二个软件关于IO活动部分输入源都是/proc/pid/io，让我们来了解下这个文件：
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: 工具介绍, 源码分析 Tags: iotop, pidstat, TASK_IO_ACCOUNTING

Newer Entries Older Entries

系统技术非业余研究

Archive

开源压缩算法Zopfli介绍

nicstat 网络流量统计利器

mmap的MAP_POPULATE标志妙用

Linux Used内存到底哪里去了？

深度剖析告诉你irqbalance有用吗？

低成本和高性能MySQL云数据的架构探索

Linux TASK_IO_ACCOUNTING功能以及如何使用

buy me a coffee.

Recent Posts

Recent Comments

Categories

Blogroll

Archives

Meta