源码分析 | 系统技术非业余研究

Erlang 网络密集型服务器的瓶颈和解决思路

November 11th, 2013 Yu Feng 2 comments

原创文章，转载请注明： 转载自系统技术非业余研究

最近我们的Erlang IO密集型的服务器程序要做细致的性能提升，从每秒40万包处理提升到60万目标，需要对进程和IO调度器的原理很熟悉，并且对行为进行微调，花了不少时间参阅了相关的文档和代码。

其中最有价值的二篇文章是：
1. Characterizing the Scalability of Erlang VM on Many-core Processors 参见这里
2. Evaluate the benefits of SMP support for IO-intensive Erlang applications 参见这里

我们的性能瓶颈目前根据 lcnt 的提示：

1. 调度器运行队列的锁冲突，参见下图：

2. erlang只有单个poll set, 大量的IO导致性能瓶颈,摘抄“Evaluate the benefits of SMP support for IO-intensive Erlang applications” P46的结论如下：
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析, 网络编程, 调优 Tags: migrated, system_info(scheduling_statistics), total_scheduling_statistics

获取binary更详细的信息

November 7th, 2013 Yu Feng 1 comment

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: 获取binary更详细的信息

binary数据结构我们用的比较多，效率的高低直接影响了服务器的性能。Erlang官方文档“Constructing and matching binaries” 这个章节提供了非常详细的高效binary使用的解释。

并且erts内部提供了未公开选项让我们知道更多细节，演示如下：

$ erl
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
1> erts_debug:set_internal_state(available_internal_state, true).
false

=ERROR REPORT==== 6-Nov-2013::23:55:47 ===
Process <0.31.0> enabled access to the emulator internal state.
NOTE: This is an erts internal test feature and should *only* be used by OTP test-suites.

2> erts_debug:get_internal_state({binary_info, <<1:99999>>}).    
{refc_binary,12500,{binary,25000},3}

3> erts_debug:get_internal_state({binary_info, <<"XYZ">>}).    
{refc_binary,3,{binary,256},3}

那么如何解读返回的值呢？还是上源码吧！

BIF_RETTYPE erts_debug_get_internal_state_1(BIF_ALIST_1)
{
...
                        pb = (ProcBin *) binary_val(real_bin);
                        val = pb->val;
                        (void) erts_bld_uint(NULL, &hsz, pb->size);
                        (void) erts_bld_uint(NULL, &hsz, val->orig_size);
                        hp = HAlloc(BIF_P, hsz);

                        /* Info about the Binary* object */
                        SzTerm = erts_bld_uint(&hp, NULL, val->orig_size);
                        res = TUPLE2(hp, am_binary, SzTerm);
                        hp += 3;

                        /* Info about the ProcBin* object */
                        SzTerm = erts_bld_uint(&hp, NULL, pb->size);
                        res = TUPLE4(hp, AM_refc_binary, SzTerm,
                                     res, make_small(pb->flags));

...
}

从源码可以看出返回值的格式:
proc binary：{refc_binary, pb_size, {binary, orig_size}, pb_flags}
heapbinary：heap_binary

#define PB_IS_WRITABLE 1 /* Writable (only one reference to ProcBin) */
#define PB_ACTIVE_WRITER 2 /* There is an active writer */
其中pb_flags为上面二个标志的组合。

从这些信息我们可以验证binary是会预留部分空间的。

小结：类似这样的未公开获取内部信息还有不少，可以参考这里

祝玩的开心！

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析 Tags: binary_info, get_internal_state

Erlang取当前时间的瓶颈以及解决方案

November 4th, 2013 Yu Feng 4 comments

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: Erlang取当前时间的瓶颈以及解决方案

高性能网络服务器通常会涉及大量和时间相关的场景和操作，比如定时器，读取事件的发生时间，日志等等。

erlang提供了二种方式来获取时间：
1. erlang:now()
2. os:timestamp()
获取取到时间后，我们通常用calendar:now_to_universal_time来格式化类似”{{2013,11,4},{8,46,20}}”这样人可读的时间。

由于时间调用非常的频繁，而且通常发生在关键路径上，所以效率和性能就非常值得深挖了。

我们先来看下这二个函数的说明：
erlang:now 参看这里

now() -> Timestamp

Types:

Timestamp = timestamp()
timestamp() =
{MegaSecs :: integer() >= 0,
Secs :: integer() >= 0,
MicroSecs :: integer() >= 0}
Returns the tuple {MegaSecs, Secs, MicroSecs} which is the elapsed time since 00:00 GMT, January 1, 1970 (zero hour) on the assumption that the underlying OS supports this. Otherwise, some other point in time is chosen. It is also guaranteed that subsequent calls to this BIF returns continuously increasing values. Hence, the return value from now() can be used to generate unique time-stamps, and if it is called in a tight loop on a fast machine the time of the node can become skewed.

It can only be used to check the local time of day if the time-zone info of the underlying operating system is properly configured.

If you do not need the return value to be unique and monotonically increasing, use os:timestamp/0 instead to avoid some overhead.

os:timestamp 参看这里

timestamp() -> Timestamp

Types:

Timestamp = erlang:timestamp()
Timestamp = {MegaSecs, Secs, MicroSecs}
Returns a tuple in the same format as erlang:now/0. The difference is that this function returns what the operating system thinks (a.k.a. the wall clock time) without any attempts at time correction. The result of two different calls to this function is not guaranteed to be different.

The most obvious use for this function is logging. The tuple can be used together with the function calendar:now_to_universal_time/1 or calendar:now_to_local_time/1 to get calendar time. Using the calendar time together with the MicroSecs part of the return tuple from this function allows you to log timestamps in high resolution and consistent with the time in the rest of the operating system.

但是事情没这么简单!

由于erlang支持时间纠正机制，简单的说在时间发生突变的时候，还能维持正常的时间逻辑，具体的实现参看这篇：服务器时间校正思考。

时间纠正机制让事情变得复杂，这个时间纠正机制如何禁止呢：

+c
Disable compensation for sudden changes of system time.

Normally, erlang:now/0 will not immediately reflect sudden changes in the system time, in order to keep timers (including receive-after) working. Instead, the time maintained by erlang:now/0 is slowly adjusted towards the new system time. (Slowly means in one percent adjustments; if the time is off by one minute, the time will be adjusted in 100 minutes.)

When the +c option is given, this slow adjustment will not take place. Instead erlang:now/0 will always reflect the current system time. Note that timers are based on erlang:now/0. If the system time jumps, timers then time out at the wrong time.

正是由于时间纠正机制的存在，所以服务器需要不时的修正时间，同一时刻可能还有很多线程在读取时间，为了维护时间的一致性，需要有个锁来保护。
我们来看下相关的代码实现：
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析, 调优 Tags: erlang:now, lcnt, os:timestamp, rebar

inet驱动新增加{active,N} socket选项

November 3rd, 2013 Yu Feng 5 comments

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: inet驱动新增加{active,N} socket选项

Erlang实现的网络服务器性能是非常高的，一个典型的服务器比如proxy我们可以处理40万个包的进出，链接数在万级别的。当然这么高的网络能力和底层的epoll实现有很大关系。那么通常我们的gen_tcp收到内核协议栈过来完整的封包的时候，有三种方式可以通知到我们，参见inet:setopts文档

{active, true | false | once}
If the value is true, which is the default, everything received from the socket will be sent as messages to the receiving process. If the value is false (passive mode), the process must explicitly receive incoming data by calling gen_tcp:recv/2,3 or gen_udp:recv/2,3 (depending on the type of socket).

If the value is once ({active, once}), one data message from the socket will be sent to the process. To receive one more message, setopts/2 must be called again with the {active, once} option.

When using {active, once}, the socket changes behaviour automatically when data is received. This can sometimes be confusing in combination with connection oriented sockets (i.e. gen_tcp) as a socket with {active, false} behaviour reports closing differently than a socket with {active, true} behaviour. To make programming easier, a socket where the peer closed and this was detected while in {active, false} mode, will still generate the message {tcp_closed,Socket} when set to {active, once} or {active, true} mode. It is therefore safe to assume that the message {tcp_closed,Socket}, possibly followed by socket port termination (depending on the exit_on_close option) will eventually appear when a socket changes back and forth between {active, true} and {active, false} mode. However, when peer closing is detected is all up to the underlying TCP/IP stack and protocol.

Note that {active,true} mode provides no flow control; a fast sender could easily overflow the receiver with incoming messages. Use active mode only if your high-level protocol provides its own flow control (for instance, acknowledging received messages) or the amount of data exchanged is small. {active,false} mode or use of the {active, once} mode provides flow control; the other side will not be able send faster than the receiver can read.

效率最高的当然是{active, true}方式，因为这种实现一个链接只一次epoll_ctl把socket的读事件挂上去，但是这种方式有致命的缺点。因为收到的包是通过消息的方式来通知我们的，完全是异步的。在正常情况下，没啥问题，但是如果我们的服务面对互联网就有很大的风险，如果遭受攻击的时候，对端发送大量的数据包的时候，我们的系统就会异步收到大量的消息，可能会超过我们的进程处理能力。最要命的是，我们无法让包停止下来，最后的结局就是我们的服务器因为缺少内存crash了。所以在实践中，我们都会用{active,once}方式来控制包的接收频率，这样避免了安全的问题，但是带来了性能的问题。每次设定{active,once}都意味着调用一次epoll_ctl。如果strace我们的程序会发现有大量的epoll_ctl调用，基本上每秒达到QPS的数量。还有个问题也加剧了这个性能退化：erlang只有一个线程会收割epoll_wait事件，如果大量的ctl时间阻塞了事件的收割，网络处理的能力会大大下降。未来的版本官方计划会支持多个线程收割，但是现在还不行。

所以现在问题就来了，性能和安全如何平衡。Erlang出手拯救我们了，见这里：

inet driver add {active,N} socket option for TCP, UDP, and SCTP

这个功能在版本R16b03可用。

解决问题的思路很简单：
{active, true}有安全问题， {active, once}太慢， {active,N}我们一次设定来收N个消息包，摊薄epoll_ctl的代价，这样就可以大大缓解性能的压力。
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析, 调优 Tags: active, gen_tcp, inet:setopts, N}

R16B03新增加super carrier来减少mmap的系统调用

November 3rd, 2013 Yu Feng Comments off

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: R16B03新增加super carrier来减少mmap的系统调用

Erlang内存分配的框架一句话总结，从erts_alloc文档摘抄如下：

erts_alloc is an Erlang Run-Time System internal memory allocator library. erts_alloc provides the Erlang Run-Time System with a number of memory allocators.

可见Erlang的内存分配体系是非常复杂的，有很深的层次，erts内部开发人员面对的是erts_alloc来提供服务，比如分配port相关的数据结构代码如下：

pdhp = erts_alloc(ERTS_ALC_T_PORT_DATA_HEAP,
sizeof(ErtsPortDataHeap) + hsize*(sizeof(Eterm)-1));

使用起来非常简单。但是Erlang系统是个靠消息传递的语言，每个消息传递都需要分配内存，在自动Gc的时候需要释放内存，在典型的服务器上比如proxy, 每天单binary数据类型的分配和释放达到1亿次之多，所以内存分配器的效率就显的特别的重要。所以erlang采用了一套非常庞杂的内存分配系统来满足这种需求，见下图：

erlang_memory_overview

粗粗的讲，内存分配器从sys_alloc和mseg_alloc批发内存，然后再零售给终端用户。其中sys_alloc就是libc的malloc, mseg_alloc就是mmap, 通过这二个接口从操作系统大批量申请内存，我们把上图的相关部分放大下看：

我们今天要讲的就是红框的那部分，erlang系统偏向于从mmap申请内存，因为过程比libc或者tcmalloc比较可控。所以如果Erlang的应用内存使用非常密集和需求变化很大的时候，就需要经常从操作系统那里批发和归还内存。而批发通常是通过mmap来的，这就是为什么我们strace beam的时候，进程会发现有很多mmap系统调用。

我们知道mmap系统调用是要进入内核再出来的。内核在内核空间维护了一颗树（比如红黑树）来管理虚拟内存。当系统调用次数非常多的时候，开销就出来了。既然mmap是用树在内核空间，那为什么我们不能在erlang内存分配器里面自己来维护呢？这样算法是一样的，但是减少了进出内核的开销。基于这个思路，最近rickard-sverker同学为Erlang R16B03添加了supercarrier，具体参见这里。

这个super carrier的原理就是通过一次向内核申请大量的内存自己管理，进一步减少mmap的调用次数，虽然mseg_alloc已经做了简单的段cache有点效果了.

我们来看下supercarrier的使用文档：

+MMscmgc
Set super carrier max guaranteed no of carriers. This parameter defaults to 65536. This parameter determines an amount of pre-allocated structures that is needed in order to keep track of different areas in the super carrier. When the system runs out of such structures it may crash due to an out of memory condition.
+MMsco true|false
Set super carrier only flag. This flag defaults to true. When a super carrier is used and this flag is true, the system will crash when a carrier request cannot be satisfied by the super carrier. When the flag is false the system will try to create requested carrier by other means.

NOTE: Setting this flag to false may not be supported on all systems. This flag will in that case be ignored.

NOTE: The super carrier cannot be enabled nor disabled on halfword heap systems. This flag will be ignored on halfword heap systems.
+MMscrpm true|false
Set super carrier reserve physical memory flag. This flag defaults to true. When this flag is true, physical memory will be reserved for the whole super carrier at once when it is created. The reservation will after that be left unchanged. When this flag is set to false only virtual address space will be reserved for the super carrier upon creation. The system will attempt to reserve physical memory upon carrier creations in the super carrier, and attempt to unreserve physical memory upon carrier destructions in the super carrier.

NOTE: What reservation of physical memory actually means highly depends on the operating system, and how it is configured. For example, different memory overcommit settings on Linux drastically change the behaviour. Also note, setting this flag to false may not be supported on all systems. This flag will in that case be ignored.

NOTE: The super carrier cannot be enabled nor disabled on halfword heap systems. This flag will be ignored on halfword heap systems.
+MMscs
Set super carrier size (in MB). The super carrier size defaults to zero; i.e, the super carrier is by default disabled. The super carrier is a large continuous area in the virtual address space. The system will always try to create new carriers in the super carrier.

NOTE: The super carrier cannot be enabled nor disabled on halfword heap systems. This flag will be ignored on halfword heap systems.

关键参数有二个：MMscs控制一次向内核申请的内存的总量，MMscrpm控制申请的内存要不要马上兑现（马上分配物理内存）。

我们来演示下supercarrier的使用，我们一次性给到erts 16G内存，用到的beam版本是2013/11/02号github上的erlang/otp master分支：
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析, 调优 Tags: +MMscrpm, +MMscs, super carrier

R16B03提供long_schedule监控阻塞调度器的行为

October 30th, 2013 Yu Feng Comments off

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: R16B03提供long_schedule监控阻塞调度器的行为

Erlang很关键的一个能力就是软实时，可以显著提高应用的QOS。为什么可以做到软实时呢？参看这篇译文。这里面有二个东西没处理好会破坏erlang的公平调度机制：
1. BIF，靠trap机制来让出执行。
2. NIF，靠减少reductions来让出执行。

这二个机制都运行用户自己写的c代码来扩展erlang vm的功能。这些代码是跑在虚拟机的调度线程里头的，一旦每次处理太多东西，或者死锁什么的，会阻塞调度器,导致VM挂起，问题还是比较严重的。

erlang在最新的R16B03的版本中，很贴心的提供了long_schedule监控，让用户来提前发现这个问题并且解决这个问题。我摘抄下long_schedule的描述：

erlang:system_monitor(Arg) -> MonSettings

{long_schedule, Time}
If a process or port in the system runs uninterrupted for at least Time wall clock milliseconds, a message {monitor, PidOrPort, long_schedule, Info} is sent to MonitorPid. PidOrPort is the process or port that was running and Info is a list of two-element tuples describing the event. In case of a pid(), the tuples {timeout, Millis}, {in, Location} and {out, Location} will be present, where Location is either an MFA ({Module, Function, Arity}) describing the function where the process was scheduled in/out, or the atom undefined. In case of a port(), the tuples {timeout, Millis} and {port_op,Op} will be present. Op will be one of proc_sig, timeout, input, output, event or dist_cmd, depending on which driver callback was executing. proc_sig is an internal operation and should never appear, while the others represent the corresponding driver callbacks timeout, ready_input, ready_output, event and finally outputv (when the port is used by distribution). The Millis value in the timeout tuple will tell you the actual uninterrupted execution time of the process or port, which will always be >= the Time value supplied when starting the trace. New tuples may be added to the Info list in the future, and the order of the tuples in the list may be changed at any time without prior notice.

This can be used to detect problems with NIF’s or drivers that take too long to execute. Generally, 1 ms is considered a good maximum time for a driver callback or a NIF. However, a time sharing system should usually consider everything below 100 ms as “possible” and fairly “normal”. Schedule times above that might however indicate swapping or a NIF/driver that is misbehaving. Misbehaving NIF’s and drivers could cause bad resource utilization and bad overall performance of the system.

github上的提交参看这里，里面的testcase很好的演示了这点。

小结：system monitor能发现好多vm 潜在的问题，需要多挖掘。

祝玩得开心！

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析 Tags: long_schedule, system_monitor

R16B port并行机制详解

October 20th, 2013 Yu Feng Comments off

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: R16B port并行机制详解

R16B发布的时候，其中一个很大的亮点就是R16B port并行机制, 摘抄官方的release note如下：

— Latency of signals sent from processes to ports — Signals
from processes to ports where previously always delivered
immediately. This kept latency for such communication to a
minimum, but it could cause lock contention which was very
expensive for the system as a whole. In order to keep this
latency low also in the future, most signals from processes
to ports are by default still delivered immediately as long
as no conflicts occur. Such conflicts include not being able
to acquire the port lock, but also include other conflicts.
When a conflict occur, the signal will be scheduled for
delivery at a later time. A scheduled signal delivery may
cause a higher latency for this specific communication, but
improves the overall performance of the system since it
reduce lock contention between schedulers. The default
behavior of only scheduling delivery of these signals on
conflict can be changed by passing the +spp command line flag
to erl(1). The behavior can also be changed on port basis
using the parallelism option of the open_port/2 BIF.

而且Jeff Martin同学也在qcon上发表了一篇文章特地提到这个事情，英文版见这里，中文版见这里

那么到底什么是R16B port并行机制呢？简单的说就是erl的这个选项：

+spp Bool
Set default scheduler hint for port parallelism. If set to true, the VM will schedule port tasks when it by this can improve the parallelism in the system. If set to false, the VM will try to perform port tasks immediately and by this improve latency at the expense of parallelism. If this flag has not been passed, the default scheduler hint for port parallelism is currently false. The default used can be inspected in runtime by calling erlang:system_info(port_parallelism). The default can be overriden on port creation by passing the parallelism option to open_port/2

作用呢？我们知道每个port都会有个锁来保证送给port的消息的先来后到，当有多个进程给port发送消息的话，必然要排队等前面的消息处理完毕。这是比较正常的行为。但是Erlang设计的哲学就是消息和异步通信，进程好好的时间浪费在排队上面总是不太爽。所以就搞了个port并行机制. 当进程发现需要排队的时候，他就把消息扔给port调度器，他自己就该干啥干啥去了，反正消息是异步的，他相信port调度器会把消息投递到。port调度器拿到用户委托的消息后，择机调度请求port去完成具体的任务。

类比下现实生活的例子。比如说我去邮局寄快递，比如顺风快递，我寄了后，他会给我一个邮单号码，时候顺风会通知我邮包的情况，当然我也可以用这个邮单号码主动去查询状态。我到邮局一看，顺风快递的柜台只有一个工作人员在忙，而且寄东西人的队伍比较排很长了，这时候我有二个选择： 1. 在队伍的后面排队。 2. 我请求邮局的工作人员（比如保安）（当然可以给点小费）把我的邮包先收下，在寄东西人少的时候帮我寄下，而我就可以走了。虽然我多花钱了，但是我花在上面的时间少了，这个小费可以挣的回来的。

port并行机制也是类似的原理。启用这个机制有二种方法：
1. 全局的。erl +spp Bool
2. per port的。open_port(PortName, PortSettings)的时候打开{parallelism, true}选项。

但是任何事情都有二面性。打开这个选项后需要注意什么呢？

我们还是拿前面的寄快递的例子来看，如果每个人都象我这样的都把邮包委托给保安去寄的话，那人多的话会有什么情况呢？保安那边有成堆的邮件，他领导一看，肯定要生气了，所以保安肯定会限制邮包数目。超过了，他就不接了。所以这就是调度器的水位线。而且顺风快递工作人员也有水位线，不如全杭州的人都来寄邮件他受的了？

那这二个水位线分别是多少呢？我之前写的这篇文章 gen_tcp发送缓冲区以及水位线问题分析解释的很清楚，我简单的复述下：

1. port自己的水位线，比如说inet_tcp是：
#define INET_HIGH_WATERMARK (1024*8) /* 8k pending high => busy */
#define INET_LOW_WATERMARK (1024*4) /* 4k pending => allow more */

这个水位线可以透过inet:setopts选项来设置：
{low_watermark, Size}
{high_watermark, Size} (TCP/IP sockets)

2. MSGQ高低水位线也是8/4K，最小值是1，高不封顶。当然也有选项可以设置。
{high_msgq_watermark, Size}
{low_msgq_watermark, Size}

这篇文章还解释了“A signal delivery”这个动作。每个port都要把消息发送出去处理了才有意义，那么这个发送动作其实就是call_driver_outputv，调用port特有的driver_outputv回调函数去做实际的事情。说白了port并行机制就是控制什么时候调用call_driver_outputv, 从原来的直接调，改成如果条件不合适，就让port调度器线程择机来调用。

小结：通过port并行机制可以大大提高整个VM中大量port的吞吐量，对于port或者网络密集型(gen_tcp就是个port)的应用会有很大的帮助。

祝玩得开心！

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析, 调优 Tags: +spp, parallelism, port, watermark

Newer Entries Older Entries

系统技术非业余研究

Archive

Erlang 网络密集型服务器的瓶颈和解决思路

获取binary更详细的信息

Erlang取当前时间的瓶颈以及解决方案

inet驱动新增加{active,N} socket选项

R16B03新增加super carrier来减少mmap的系统调用

R16B03提供long_schedule监控阻塞调度器的行为

R16B port并行机制详解

buy me a coffee.

Recent Posts

Recent Comments

Categories

Blogroll

Archives

Meta