Archive for the ‘调优’ Category


April 28th, 2014 1 comment

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: Erlang内存体系调优

Lukas Larsson,核心的VM开发者,最近很活跃,在Erlang内存体系上做了不少工作,包括recon项目的贡献。

他最近在erlang factory会议上分享了“Memory Allocators in the VM, Memory Management: Battle Storie”, 参见这里



所以我们需要专家的经验把我们迅速带入门,他的PPT不再提供下载,我拉了一份,在 这里,原理、方法以及案例分析,很不错。


Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 调优 Tags: , ,


November 14th, 2013 No comments

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: 量化Erlang进程调度的代价

我们都知道erlang的基本哲学之一就是“小消息大计算”,简单的说就是尽可能的在消息里面携带完整的计算需要的信息,然后计算要尽可能的多,最好远超过消息传递的代价。但是为什么要这样呢?erlang消息发送的效率是很高的, 参见这篇文章

Roughly speaking, I’m seeing 3.4 million deliveries per second one-way, and 1.4 million roundtrips per second (2.8 million deliveries per second) in a ping-pong setup in the same environment as previously – a 2.8GHz Pentium 4 with 1MB cache.


$ erl 
Erlang R15B03 (erts-  [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
1> ipctest:pingpong().

一个完整的流程涉及到 1. ping进程运行 2. ping进程等pong消息被切出。 3. pong运行 4. pong等ping消息被切出。这个流程涉及到二次Erlang进程的调度。


$ cat sch.stp
global total, coll_sch, sch
global exclude_sys_schedule

probe process("beam.smp").function("schedule") {
      sch[tid()] = gettimeofday_ns();

probe process("beam.smp").function("schedule").return {
      tid = tid();
      e = gettimeofday_ns() - sch[tid];
      if (exclude_sys_schedule && e > 10 * 1000 * 1000 ) coll_sch <<< 0;
      else coll_sch <<< e;

function print_colls () {
      prt_line = 0;
      if(@count(coll_sch) >0) {
            printf("total %d, avg %d ns\n", total, @avg(coll_sch));
            printf("===========erts schedule(ns)===========\n");
            prt_line = 1;

      if(prt_line) printf("--------------------------------------------------------------\n");
      delete coll_sch;
      delete sch;
      delete total;

probe timer.s(1) {

probe begin {
      exclude_sys_schedule = $1

$ PATH=/usr/local/lib/erlang/erts-$PATH sudo stap sch.stp 1

Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Erlang 网络密集型服务器的瓶颈和解决思路

November 11th, 2013 2 comments

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: Erlang 网络密集型服务器的瓶颈和解决思路

最近我们的Erlang IO密集型的服务器程序要做细致的性能提升,从每秒40万包处理提升到60万目标,需要对进程和IO调度器的原理很熟悉,并且对行为进行微调,花了不少时间参阅了相关的文档和代码。

1. Characterizing the Scalability of Erlang VM on Many-core Processors 参见这里
2. Evaluate the benefits of SMP support for IO-intensive Erlang applications 参见这里

我们的性能瓶颈目前根据 lcnt 的提示:

1. 调度器运行队列的锁冲突,参见下图:

2. erlang只有单个poll set, 大量的IO导致性能瓶颈,摘抄“Evaluate the benefits of SMP support for IO-intensive Erlang applications” P46的结论如下:
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.


November 4th, 2013 4 comments

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: Erlang取当前时间的瓶颈以及解决方案


1. erlang:now()
2. os:timestamp()


erlang:now 参看这里

now() -> Timestamp


Timestamp = timestamp()
timestamp() =
{MegaSecs :: integer() >= 0,
Secs :: integer() >= 0,
MicroSecs :: integer() >= 0}
Returns the tuple {MegaSecs, Secs, MicroSecs} which is the elapsed time since 00:00 GMT, January 1, 1970 (zero hour) on the assumption that the underlying OS supports this. Otherwise, some other point in time is chosen. It is also guaranteed that subsequent calls to this BIF returns continuously increasing values. Hence, the return value from now() can be used to generate unique time-stamps, and if it is called in a tight loop on a fast machine the time of the node can become skewed.

It can only be used to check the local time of day if the time-zone info of the underlying operating system is properly configured.

If you do not need the return value to be unique and monotonically increasing, use os:timestamp/0 instead to avoid some overhead.

os:timestamp 参看这里

timestamp() -> Timestamp


Timestamp = erlang:timestamp()
Timestamp = {MegaSecs, Secs, MicroSecs}
Returns a tuple in the same format as erlang:now/0. The difference is that this function returns what the operating system thinks (a.k.a. the wall clock time) without any attempts at time correction. The result of two different calls to this function is not guaranteed to be different.

The most obvious use for this function is logging. The tuple can be used together with the function calendar:now_to_universal_time/1 or calendar:now_to_local_time/1 to get calendar time. Using the calendar time together with the MicroSecs part of the return tuple from this function allows you to log timestamps in high resolution and consistent with the time in the rest of the operating system.




Disable compensation for sudden changes of system time.

Normally, erlang:now/0 will not immediately reflect sudden changes in the system time, in order to keep timers (including receive-after) working. Instead, the time maintained by erlang:now/0 is slowly adjusted towards the new system time. (Slowly means in one percent adjustments; if the time is off by one minute, the time will be adjusted in 100 minutes.)

When the +c option is given, this slow adjustment will not take place. Instead erlang:now/0 will always reflect the current system time. Note that timers are based on erlang:now/0. If the system time jumps, timers then time out at the wrong time.

Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

inet驱动新增加{active,N} socket选项

November 3rd, 2013 5 comments

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: inet驱动新增加{active,N} socket选项


{active, true | false | once}
If the value is true, which is the default, everything received from the socket will be sent as messages to the receiving process. If the value is false (passive mode), the process must explicitly receive incoming data by calling gen_tcp:recv/2,3 or gen_udp:recv/2,3 (depending on the type of socket).

If the value is once ({active, once}), one data message from the socket will be sent to the process. To receive one more message, setopts/2 must be called again with the {active, once} option.

When using {active, once}, the socket changes behaviour automatically when data is received. This can sometimes be confusing in combination with connection oriented sockets (i.e. gen_tcp) as a socket with {active, false} behaviour reports closing differently than a socket with {active, true} behaviour. To make programming easier, a socket where the peer closed and this was detected while in {active, false} mode, will still generate the message {tcp_closed,Socket} when set to {active, once} or {active, true} mode. It is therefore safe to assume that the message {tcp_closed,Socket}, possibly followed by socket port termination (depending on the exit_on_close option) will eventually appear when a socket changes back and forth between {active, true} and {active, false} mode. However, when peer closing is detected is all up to the underlying TCP/IP stack and protocol.

Note that {active,true} mode provides no flow control; a fast sender could easily overflow the receiver with incoming messages. Use active mode only if your high-level protocol provides its own flow control (for instance, acknowledging received messages) or the amount of data exchanged is small. {active,false} mode or use of the {active, once} mode provides flow control; the other side will not be able send faster than the receiver can read.

效率最高的当然是{active, true}方式,因为这种实现一个链接只一次epoll_ctl把socket的读事件挂上去,但是这种方式有致命的缺点。因为收到的包是通过消息的方式来通知我们的,完全是异步的。在正常情况下,没啥问题,但是如果我们的服务面对互联网就有很大的风险,如果遭受攻击的时候,对端发送大量的数据包的时候,我们的系统就会异步收到大量的消息,可能会超过我们的进程处理能力。最要命的是,我们无法让包停止下来,最后的结局就是我们的服务器因为缺少内存crash了。所以在实践中,我们都会用{active,once}方式来控制包的接收频率,这样避免了安全的问题,但是带来了性能的问题。每次设定{active,once}都意味着调用一次epoll_ctl。 如果strace我们的程序会发现有大量的epoll_ctl调用,基本上每秒达到QPS的数量。还有个问题也加剧了这个性能退化:erlang只有一个线程会收割epoll_wait事件,如果大量的ctl时间阻塞了事件的收割,网络处理的能力会大大下降。未来的版本官方计划会支持多个线程收割,但是现在还不行。


inet driver add {active,N} socket option for TCP, UDP, and SCTP


{active, true}有安全问题, {active, once}太慢, {active,N}我们一次设定来收N个消息包,摊薄epoll_ctl的代价,这样就可以大大缓解性能的压力。
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.


November 3rd, 2013 6 comments

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: Erlang调度器的利用率调查

Erlang的调度器效率非常高,大概在128核的情况下有80%的利用率,即使是这样,由于CPU和内存体系的结构的限制,调度器的实现还是有大量的锁存在。erts的实现为了避免core scale的问题,通常不会采用锁在那里傻等,而是采用更乐观的无锁算法,这样会有不少的CPU空转现象。


erlang从R15以后提供了调度器的利用率调查,这个函数就是:erlang:statistics(scheduler_wall_time) 。


statistics(Item :: scheduler_wall_time) ->
[{SchedulerId, ActiveTime, TotalTime}] | undefined


SchedulerId = integer() >= 1
ActiveTime = TotalTime = integer() >= 0

Returns a list of tuples with {SchedulerId, ActiveTime, TotalTime}, where SchedulerId is an integer id of the scheduler, ActiveTime is the duration the scheduler has been busy, TotalTime is the total time duration since scheduler_wall_time activation. The time unit is not defined and may be subject to change between releases, operating systems and system restarts. scheduler_wall_time should only be used to calculate relative values for scheduler-utilization. ActiveTime can never exceed TotalTime.

The definition of a busy scheduler is when it is not idle or not scheduling (selecting) a process or port, meaning; executing process code, executing linked-in-driver or NIF code, executing built-in-functions or any other runtime handling, garbage collecting or handling any other memory management. Note, a scheduler may also be busy even if the operating system has scheduled out the scheduler thread.

Returns undefined if the system flag scheduler_wall_time is turned off.

The list of scheduler information is unsorted and may appear in different order between calls.

Using scheduler_wall_time to calculate scheduler utilization.

> erlang:system_flag(scheduler_wall_time, true).
> Ts0 = lists:sort(erlang:statistics(scheduler_wall_time)), ok.
Some time later we will take another snapshot and calculate scheduler-utilization per scheduler.

> Ts1 = lists:sort(erlang:statistics(scheduler_wall_time)), ok.
> lists:map(fun({{I, A0, T0}, {I, A1, T1}}) ->
{I, (A1 – A0)/(T1 – T0)} end, lists:zip(Ts0,Ts1)).
Using the same snapshots to calculate a total scheduler-utilization.

> {A, T} = lists:foldl(fun({{_, A0, T0}, {_, A1, T1}}, {Ai,Ti}) ->
{Ai + (A1 – A0), Ti + (T1 – T0)} end, {0, 0}, lists:zip(Ts0,Ts1)), A/T.

其中要注意的是”scheduler_wall_time is by default disabled. Use erlang:system_flag(scheduler_wall_time, true) to enable it.”。原因是运行期需要去做统计工作会影响性能。而且函数返回的每个调度器的使用情况顺序是乱的,需要排序下。

percept2提供了个percept2_sampling来帮我们可视化这个利用率, 演示如下:


$ erl -pa percept2/ebin
Erlang R15B03 (erts-  [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
1> percept2:start_webserver(8933).
2> percept2_sampling:start([all], 60000, ".").





Post Footer automatically generated by wp-posturl plugin for wordpress.

R16B03新增加super carrier来减少mmap的系统调用

November 3rd, 2013 No comments

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: R16B03新增加super carrier来减少mmap的系统调用


erts_alloc is an Erlang Run-Time System internal memory allocator library. erts_alloc provides the Erlang Run-Time System with a number of memory allocators.


pdhp = erts_alloc(ERTS_ALC_T_PORT_DATA_HEAP,
sizeof(ErtsPortDataHeap) + hsize*(sizeof(Eterm)-1));

使用起来非常简单。但是Erlang系统是个靠消息传递的语言,每个消息传递都需要分配内存,在自动Gc的时候需要释放内存,在典型的服务器上比如proxy, 每天单binary数据类型的分配和释放达到1亿次之多,所以内存分配器的效率就显的特别的重要。 所以erlang采用了一套非常庞杂的内存分配系统来满足这种需求,见下图:


粗粗的讲,内存分配器从sys_alloc和mseg_alloc批发内存,然后再零售给终端用户。其中sys_alloc就是libc的malloc, mseg_alloc就是mmap, 通过这二个接口从操作系统大批量申请内存,我们把上图的相关部分放大下看:


我们今天要讲的就是红框的那部分,erlang系统偏向于从mmap申请内存,因为过程比libc或者tcmalloc比较可控。所以如果Erlang的应用内存使用非常密集和需求变化很大的时候,就需要经常从操作系统那里批发和归还内存。而批发通常是通过mmap来的,这就是为什么我们strace beam的时候,进程会发现有很多mmap系统调用。

我们知道mmap系统调用是要进入内核再出来的。内核在内核空间维护了一颗树(比如红黑树)来管理虚拟内存。当系统调用次数非常多的时候,开销就出来了。既然mmap是用树在内核空间,那为什么我们不能在erlang内存分配器里面自己来维护呢?这样算法是一样的,但是减少了进出内核的开销。基于这个思路,最近rickard-sverker同学为Erlang R16B03添加了supercarrier, 具体参见这里

这个super carrier的原理就是通过一次向内核申请大量的内存自己管理,进一步减少mmap的调用次数,虽然mseg_alloc已经做了简单的段cache有点效果了.


Set super carrier max guaranteed no of carriers. This parameter defaults to 65536. This parameter determines an amount of pre-allocated structures that is needed in order to keep track of different areas in the super carrier. When the system runs out of such structures it may crash due to an out of memory condition.
+MMsco true|false
Set super carrier only flag. This flag defaults to true. When a super carrier is used and this flag is true, the system will crash when a carrier request cannot be satisfied by the super carrier. When the flag is false the system will try to create requested carrier by other means.

NOTE: Setting this flag to false may not be supported on all systems. This flag will in that case be ignored.

NOTE: The super carrier cannot be enabled nor disabled on halfword heap systems. This flag will be ignored on halfword heap systems.
+MMscrpm true|false
Set super carrier reserve physical memory flag. This flag defaults to true. When this flag is true, physical memory will be reserved for the whole super carrier at once when it is created. The reservation will after that be left unchanged. When this flag is set to false only virtual address space will be reserved for the super carrier upon creation. The system will attempt to reserve physical memory upon carrier creations in the super carrier, and attempt to unreserve physical memory upon carrier destructions in the super carrier.

NOTE: What reservation of physical memory actually means highly depends on the operating system, and how it is configured. For example, different memory overcommit settings on Linux drastically change the behaviour. Also note, setting this flag to false may not be supported on all systems. This flag will in that case be ignored.

NOTE: The super carrier cannot be enabled nor disabled on halfword heap systems. This flag will be ignored on halfword heap systems.
Set super carrier size (in MB). The super carrier size defaults to zero; i.e, the super carrier is by default disabled. The super carrier is a large continuous area in the virtual address space. The system will always try to create new carriers in the super carrier.

NOTE: The super carrier cannot be enabled nor disabled on halfword heap systems. This flag will be ignored on halfword heap systems.


我们来演示下supercarrier的使用,我们一次性给到erts 16G内存,用到的beam版本是2013/11/02号github上的erlang/otp master分支:
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.