Yu Feng | 系统技术非业余研究

gen_tcp接收缓冲区易混淆概念纠正

May 14th, 2013 Yu Feng 1 comment

Erlang的每个TCP网络链接是由相应的gen_tcp对象来表示的，说白了就是个port, 实现Erlang网络相关的逻辑，其实现代码位于erts/emulator/drivers/common/inet_drv.c

参照inet:setopts文档，它有三个buffer相关的选项，非常让人费解：

{buffer, Size}
Determines the size of the user-level software buffer used by the driver. Not to be confused with sndbuf and recbuf options which correspond to the kernel socket buffers. It is recommended to have val(buffer) >= max(val(sndbuf),val(recbuf)). In fact, the val(buffer) is automatically set to the above maximum when sndbuf or recbuf values are set.

{recbuf, Size}
Gives the size of the receive buffer to use for the socket.

{sndbuf, Size}
Gives the size of the send buffer to use for the socket.

其中sndbuf, recbuf选项比较好理解，就是设置gen_tcp所拥有的socket句柄的内核的发送和接收缓冲区，从代码可以验证：

/* inet_drv.c */
#define INET_OPT_SNDBUF     6   /* set send buffer size */
#define INET_OPT_RCVBUF     7   /* set receive buffer size */
static int inet_set_opts(inet_descriptor* desc, char* ptr, int len)
{
...
        case INET_OPT_SNDBUF:    type = SO_SNDBUF;
            DEBUGF(("inet_set_opts(%ld): s=%d, SO_SNDBUF=%d\r\n",
                    (long)desc->port, desc->s, ival));
            break;
        case INET_OPT_RCVBUF:    type = SO_RCVBUF;
            DEBUGF(("inet_set_opts(%ld): s=%d, SO_RCVBUF=%d\r\n",
                    (long)desc->port, desc->s, ival));
            break;
...
        res = sock_setopt           (desc->s, proto, type, arg_ptr, arg_sz);
...
}

那buffer是什么呢，他们三者之间的关系？从文档的描述来看：
It is recommended to have val(buffer) >= max(val(sndbuf),val(recbuf)). In fact, the val(buffer) is automatically set to the above maximum when sndbuf or recbuf values are set.
Read more…

Categories: Erlang探索, 源码分析 Tags: buffer, gen_tcp, recbuf, sndbuf

log2的快速计算法

May 9th, 2013 Yu Feng 8 comments

从erl_mseg.c中摘抄的：

static const int debruijn[32] = {

0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};

#define LOG2(X) (debruijn[((Uint32)(((X) & -(X)) * 0x077CB531U)) >> 27])

供大家参考！

Categories: 源码分析 Tags: debruijn, log2

Erlang集群全联通问题及解决方案

May 2nd, 2013 Yu Feng 10 comments

Erlang的集群默认情况下是全联通的，也就是当一个节点加入集群的时候，介绍人会推荐集群里面所有的节点主动来和新加入的节点建立联系，
效果如下图：

具体点讲那就是net_kernel模块负责节点间的通道的建立、检查、断开并提供monitor_node语义。
摘抄 http://www.erlang.org/doc/man/erlang.html 如下：

monitor_node(Node, Flag) -> true
Types:
Node = node()
Flag = boolean()
Monitors the status of the node Node. If Flag is true, monitoring is turned on; if Flag is false, monitoring is turned off.

Making several calls to monitor_node(Node, true) for the same Node is not an error; it results in as many, completely independent, monitorings.

If Node fails or does not exist, the message {nodedown, Node} is delivered to the process. If a process has made two calls to monitor_node(Node, true) and Node terminates, two nodedown messages are delivered to the process. If there is no connection to Node, there will be an attempt to create one. If this fails, a nodedown message is delivered.

Nodes connected through hidden connections can be monitored as any other node.

Failure: badargif the local node is not alive.

其他模块如global, pg2, mnesia都利用这个monitor_node提供的语义来实现更上层的逻辑。那么上面提到的引荐机制正是global模块实现的，它的目的是提供集群层面的名称和进程的映射关系，所以它需要全联通。

学过中学数学的同学都知道系统总的通道的数目是N*(N-1)/2, 随着N的增长，这个数目会急速上升，见下图。

这个对集群的规模有致命的破坏作用。这么多链接需要耗用很多资源，更坏的是，erlang为了检测节点的存活，需要定期发心跳包来检查，一分钟一个tick, 这会造成大量的网络风暴。

那么我们如何来避免这个事情呢？
Read more…

Categories: Erlang探索, 源码分析 Tags: monitor_node, 全联通

Erlang集群RPC通道拥塞问题及解决方案

May 2nd, 2013 Yu Feng 1 comment

我们这次不讲如何避免全联通而是来讲这个节点间通道的问题。

我们知道erlang的消息发送是透明的，只要调用Pid!Msg, 虚拟机和集群的基础设施会保证消息到达指定的进程的消息队列，这个是语义方面的保证。那么如果该Pid是在别的节点，这个消息就会通过节点间的rpc通道来传递。rpc模块就是基于erlang的这个语义在上面实现了远程函数调用。

目前社区推比较推荐erlang服务分层，所以层和层之间的交互基本上透过rpc来进行的。类似下图的分层结构越来越多，当大量的消息在节点间流动的话，势必会造成通道拥塞。

阻塞会导致发送进程被挂起，而rpc是单进程(gen_server)的，被挂起，rpc调用就废了。当然除了RPC， Pid!Msg 这种方式还是可以并行的走的。
这种阻塞极大的影响力系统的rt, 对性能和体验有很大的影响。

那这个问题如何定位、解决呢？Erlang很贴心的提供了一揽子解决方案：

首先是发现问题：

erlang:system_monitor(MonitorPid, Options) -> MonSettings

busy_dist_port
If a process in the system gets suspended because it sends to a process on a remote node whose inter-node communication was handled by a busy port, a message {monitor, SusPid, busy_dist_port, Port} is sent to MonitorPid. SusPid is the pid that got suspended when sending through the inter-node communication port Port.

比如说 riak_sysmon 就用了以下代码：

 BusyDistPortP = get_busy_dist_port(),
    Opts = lists:flatten(
             [[{long_gc, GcMsLimit} || lists:member(gc, MonitorProps)
                                           andalso GcMsLimit > 0],
              [{large_heap, HeapWordLimit} || lists:member(heap, MonitorProps)
                                                  andalso HeapWordLimit > 0],
              [busy_port || lists:member(port, MonitorProps)
                                andalso BusyPortP],
              [busy_dist_port || lists:member(dist_port, MonitorProps)
                                     andalso BusyDistPortP]]),
    _ = erlang:system_monitor(self(), Opts),

当我们收到{monitor, SusPid, busy_dist_port, Port}消息的时候，就可以确认系统经常有阻塞问题。

那么如何解决呢？

社区早就认识到这个问题，所以设计dist_buf_busy_limit是个可配置的值。
Read more…

Categories: Erlang探索, 源码分析 Tags: +zdbbl, busy_dist_port, dist_buf_busy_limit, system_monitor

whatsapp深度使用Erlang有感

April 30th, 2013 Yu Feng 13 comments

这么多年过去了，社区还在讨论erlang是不是小众语言，各种怀疑的时候，whatsapp已经把erlang用到了极致。

whatsapp是什么？参见它的官网

WhatsApp Messenger is a cross-platform mobile messaging app which allows you to exchange messages without having to pay for SMS.

更为搞笑的是主要开发者Rick Reed（rr@whatsapp.com），之前在Yahoo!, SGI工作，有着深厚的系统性能的背景。

在2012年开发push服务器的时候：efsf2012-whatsapp-scaling

Joined WhatsApp in 2011，New to Erlang

完全是个新手。

在2013年开发多媒体支持系统的时候：reed-efsf2013-whatsapp

Joined server team at WhatsApp in 2011，No prior Erlang experience

2-3年后他已经是Erlang的最高级别的行家了。

从他的二篇ppt的内容来看，他把erlang的特性发挥到了极致，利用到了它最好的vm、集群基础设施、数据库mnesia, 消除了非常多的数据Scale、内存池和锁的问题，提到的技术和修正点非常值得我们参考。

虽然大部分的解决方法我们在日常都差不多用过。但是他很系统的整理出来，用在商业系统了，这是个非常大的飞跃。

下面摘抄几个数据，希望能让继续怀疑erlang的人能重新思考下：

whatsapp后台架构以erlang为主：

Read more…

Categories: Erlang探索, 体系结构 Tags: tuning, whatsapp, 调优

lz4: Extremely Fast Compression algorithm

March 15th, 2013 Yu Feng 9 comments

最近在不少项目特别是存储相关的项目用到了lz4压缩算法，它有什么特点呢？

LZ4 is a very fast lossless compression algorithm, providing compression speed at 300 MB/s per core, scalable with multi-cores CPU. It also features an extremely fast decoder, with speeds up and beyond 1GB/s per core, typically reaching RAM speed limits on multi-core systems.

这个特性对于需要大吞吐量的压缩场合还是非常有用的，以很小的CPU代价换来更大的存储密度。

官方网站：https://code.google.com/p/lz4/，摘抄下它的性能指标：

Name Ratio C.speed D.speed
LZ4 (r59) 2.084 330 915
LZO 2.05 1x_1 2.038 311 480
QuickLZ 1.5 -1 2.233 257 277
Snappy 1.0.5 2.024 227 729
LZF 2.076 197 465
FastLZ 2.030 190 420
zlib 1.2.5 -1 2.728 39 195
LZ4 HC (r66) 2.712 18 1020
zlib 1.2.5 -6 3.095 14 210

更多的测试可以看这里这里

它还有个高压缩率的版本：

LZ4 HC – High Compression Mode of LZ4

从源码lz4.c可以看到快的原因之一：

这个技术叫做 “The Blocking Technique”, 见图：

考虑在项目中用起来。

祝玩得开心！

Categories: Linux, 杂七杂八, 源码分析 Tags: lz4

Linux下如何知道文件被那个进程写

March 12th, 2013 Yu Feng 36 comments

晚上朔海同学问：

一个文件正在被进程写我想查看这个进程文件一直在增大找不到谁在写使用lsof也没找到

这个问题挺有普遍性的，解决方法应该很多，这里我给大家提个比较直观的方法。

linux下每个文件都会在某个块设备上存放，当然也都有相应的inode, 那么透过vfs.write我们就可以知道谁在不停的写入特定的设备上的inode。
幸运的是systemtap的安装包里带了inodewatch.stp，位于/usr/local/share/doc/systemtap/examples/io目录下，就是用来这个用途的。
我们来看下代码：

$ cat inodewatch.stp 
#! /usr/bin/env stap

probe vfs.write, vfs.read
{
  # dev and ino are defined by vfs.write and vfs.read
  if (dev == MKDEV($1,$2) # major/minor device
      && ino == $3)
    printf ("%s(%d) %s 0x%x/%u\n",
      execname(), pid(), probefunc(), dev, ino)
}

这个脚本的使用方法如下： stap inodewatch.stp major minor ino

下面我们构造个场景： dd不停的写入一个文件，查出这个文件的ino, 以及它所在设备的major, minor, 运行stap脚本就可以得到答案。

系统技术非业余研究

Archive

gen_tcp接收缓冲区易混淆概念纠正

log2的快速计算法

Erlang集群全联通问题及解决方案

Erlang集群RPC通道拥塞问题及解决方案

whatsapp深度使用Erlang有感

lz4: Extremely Fast Compression algorithm

Linux下如何知道文件被那个进程写

buy me a coffee.

Recent Posts

Recent Comments

Categories

Blogroll

Archives

Meta