Archive

Archive for November, 2013

求贤帖

November 23rd, 2013 Comments off

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: 求贤帖

作为一个优秀的工程师,你其实不缺少才华,你缺少的是神一样的队友、充满挑战的世界级技术难题,和一个可以施展自己才华的大舞台。加入阿里核心系统数据库开发团队吧,你缺的这里都有。来吧,戳这里,给我们见识你的机会:http://blog.yufeng.info/job

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: 生活 Tags: ,

量化Erlang进程调度的代价

November 14th, 2013 Comments off

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: 量化Erlang进程调度的代价

我们都知道erlang的基本哲学之一就是“小消息大计算”,简单的说就是尽可能的在消息里面携带完整的计算需要的信息,然后计算要尽可能的多,最好远超过消息传递的代价。但是为什么要这样呢?erlang消息发送的效率是很高的, 参见这篇文章

Roughly speaking, I’m seeing 3.4 million deliveries per second one-way, and 1.4 million roundtrips per second (2.8 million deliveries per second) in a ping-pong setup in the same environment as previously – a 2.8GHz Pentium 4 with 1MB cache.

在我的机器上的演示下看看具体的数字:

$ erl 
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
1> ipctest:pingpong().
832296.5692402497
2> 

大概83万每秒个消息pingpong,测试程序涉及到二个Erlang进程ping和pong.
一个完整的流程涉及到 1. ping进程运行 2. ping进程等pong消息被切出。 3. pong运行 4. pong等ping消息被切出。这个流程涉及到二次Erlang进程的调度。
这是一个典型的erlang使用的场景,我们现在的问题是到底一个erlang进程调度的开销是多少?
从erts的实现来看,erlang会调用schedule()函数来选择下一个要调度的进程,而真正swapin和swapout的代价并不高,那我们来统计下schedule的开销。

还是祭出我们伟大的stap,写段调查代码先:

$ cat sch.stp
global total, coll_sch, sch
global exclude_sys_schedule

probe process("beam.smp").function("schedule") {
      sch[tid()] = gettimeofday_ns();
      total++;
}

probe process("beam.smp").function("schedule").return {
      tid = tid();
      e = gettimeofday_ns() - sch[tid];
      if (exclude_sys_schedule && e > 10 * 1000 * 1000 ) coll_sch <<< 0;
      else coll_sch <<< e;
}

function print_colls () {
      prt_line = 0;
      if(@count(coll_sch) >0) {
            printf("total %d, avg %d ns\n", total, @avg(coll_sch));
            printf("===========erts schedule(ns)===========\n");
            print(@hist_log(coll_sch));
            prt_line = 1;
      }

      if(prt_line) printf("--------------------------------------------------------------\n");
      delete coll_sch;
      delete sch;
      delete total;
}

probe timer.s(1) {
      print_colls();
}

probe begin {
      exclude_sys_schedule = $1
      println("x:");
}

$ PATH=/usr/local/lib/erlang/erts-5.9.3.1/bin/:$PATH sudo stap sch.stp 1
x:

如果调度器在不忙或者调度足够多的进程后,需要收割epoll事件,也就是会调用sys_schedule,这个时间通常会是ms级别的,我们将之排除掉,避免对平均时间的很大干扰。
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Erlang 网络密集型服务器的瓶颈和解决思路

November 11th, 2013 2 comments

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: Erlang 网络密集型服务器的瓶颈和解决思路

最近我们的Erlang IO密集型的服务器程序要做细致的性能提升,从每秒40万包处理提升到60万目标,需要对进程和IO调度器的原理很熟悉,并且对行为进行微调,花了不少时间参阅了相关的文档和代码。

其中最有价值的二篇文章是:
1. Characterizing the Scalability of Erlang VM on Many-core Processors 参见这里
2. Evaluate the benefits of SMP support for IO-intensive Erlang applications 参见这里

我们的性能瓶颈目前根据 lcnt 的提示:

1. 调度器运行队列的锁冲突,参见下图:
lcnt_rq_conflict

2. erlang只有单个poll set, 大量的IO导致性能瓶颈,摘抄“Evaluate the benefits of SMP support for IO-intensive Erlang applications” P46的结论如下:
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

获取binary更详细的信息

November 7th, 2013 1 comment

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: 获取binary更详细的信息

binary数据结构我们用的比较多,效率的高低直接影响了服务器的性能。Erlang官方文档“Constructing and matching binaries” 这个章节 提供了非常详细的高效binary使用的解释。

并且erts内部提供了未公开选项让我们知道更多细节,演示如下:

$ erl
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
1> erts_debug:set_internal_state(available_internal_state, true).
false

=ERROR REPORT==== 6-Nov-2013::23:55:47 ===
Process <0.31.0> enabled access to the emulator internal state.
NOTE: This is an erts internal test feature and should *only* be used by OTP test-suites.

2> erts_debug:get_internal_state({binary_info, <<1:99999>>}).    
{refc_binary,12500,{binary,25000},3}

3> erts_debug:get_internal_state({binary_info, <<"XYZ">>}).    
{refc_binary,3,{binary,256},3}

那么如何解读返回的值呢?还是上源码吧!

BIF_RETTYPE erts_debug_get_internal_state_1(BIF_ALIST_1)
{
...
                        pb = (ProcBin *) binary_val(real_bin);
                        val = pb->val;
                        (void) erts_bld_uint(NULL, &hsz, pb->size);
                        (void) erts_bld_uint(NULL, &hsz, val->orig_size);
                        hp = HAlloc(BIF_P, hsz);

                        /* Info about the Binary* object */
                        SzTerm = erts_bld_uint(&hp, NULL, val->orig_size);
                        res = TUPLE2(hp, am_binary, SzTerm);
                        hp += 3;

                        /* Info about the ProcBin* object */
                        SzTerm = erts_bld_uint(&hp, NULL, pb->size);
                        res = TUPLE4(hp, AM_refc_binary, SzTerm,
                                     res, make_small(pb->flags));

...
}

从源码可以看出返回值的格式:
proc binary:{refc_binary, pb_size, {binary, orig_size}, pb_flags}
heapbinary:heap_binary

#define PB_IS_WRITABLE 1 /* Writable (only one reference to ProcBin) */
#define PB_ACTIVE_WRITER 2 /* There is an active writer */
其中pb_flags为上面二个标志的组合。

从这些信息我们可以验证binary是会预留部分空间的。

小结:类似这样的未公开获取内部信息还有不少,可以参考这里

祝玩的开心!

Post Footer automatically generated by wp-posturl plugin for wordpress.

Erlang取当前时间的瓶颈以及解决方案

November 4th, 2013 4 comments

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: Erlang取当前时间的瓶颈以及解决方案

高性能网络服务器通常会涉及大量和时间相关的场景和操作,比如定时器,读取事件的发生时间,日志等等。

erlang提供了二种方式来获取时间:
1. erlang:now()
2. os:timestamp()
获取取到时间后,我们通常用calendar:now_to_universal_time来格式化类似”{{2013,11,4},{8,46,20}}”这样人可读的时间。

由于时间调用非常的频繁,而且通常发生在关键路径上,所以效率和性能就非常值得深挖了。

我们先来看下这二个函数的说明:
erlang:now 参看这里

now() -> Timestamp

Types:

Timestamp = timestamp()
timestamp() =
{MegaSecs :: integer() >= 0,
Secs :: integer() >= 0,
MicroSecs :: integer() >= 0}
Returns the tuple {MegaSecs, Secs, MicroSecs} which is the elapsed time since 00:00 GMT, January 1, 1970 (zero hour) on the assumption that the underlying OS supports this. Otherwise, some other point in time is chosen. It is also guaranteed that subsequent calls to this BIF returns continuously increasing values. Hence, the return value from now() can be used to generate unique time-stamps, and if it is called in a tight loop on a fast machine the time of the node can become skewed.

It can only be used to check the local time of day if the time-zone info of the underlying operating system is properly configured.

If you do not need the return value to be unique and monotonically increasing, use os:timestamp/0 instead to avoid some overhead.

os:timestamp 参看这里

timestamp() -> Timestamp

Types:

Timestamp = erlang:timestamp()
Timestamp = {MegaSecs, Secs, MicroSecs}
Returns a tuple in the same format as erlang:now/0. The difference is that this function returns what the operating system thinks (a.k.a. the wall clock time) without any attempts at time correction. The result of two different calls to this function is not guaranteed to be different.

The most obvious use for this function is logging. The tuple can be used together with the function calendar:now_to_universal_time/1 or calendar:now_to_local_time/1 to get calendar time. Using the calendar time together with the MicroSecs part of the return tuple from this function allows you to log timestamps in high resolution and consistent with the time in the rest of the operating system.

但是事情没这么简单!

由于erlang支持时间纠正机制,简单的说在时间发生突变的时候,还能维持正常的时间逻辑,具体的实现参看这篇:服务器时间校正思考。

时间纠正机制让事情变得复杂,这个时间纠正机制如何禁止呢:

+c
Disable compensation for sudden changes of system time.

Normally, erlang:now/0 will not immediately reflect sudden changes in the system time, in order to keep timers (including receive-after) working. Instead, the time maintained by erlang:now/0 is slowly adjusted towards the new system time. (Slowly means in one percent adjustments; if the time is off by one minute, the time will be adjusted in 100 minutes.)

When the +c option is given, this slow adjustment will not take place. Instead erlang:now/0 will always reflect the current system time. Note that timers are based on erlang:now/0. If the system time jumps, timers then time out at the wrong time.

正是由于时间纠正机制的存在,所以服务器需要不时的修正时间,同一时刻可能还有很多线程在读取时间,为了维护时间的一致性,需要有个锁来保护。
我们来看下相关的代码实现:
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

inet驱动新增加{active,N} socket选项

November 3rd, 2013 5 comments

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: inet驱动新增加{active,N} socket选项

Erlang实现的网络服务器性能是非常高的,一个典型的服务器比如proxy我们可以处理40万个包的进出,链接数在万级别的。当然这么高的网络能力和底层的epoll实现有很大关系。那么通常我们的gen_tcp收到内核协议栈过来完整的封包的时候,有三种方式可以通知到我们,参见inet:setopts文档

{active, true | false | once}
If the value is true, which is the default, everything received from the socket will be sent as messages to the receiving process. If the value is false (passive mode), the process must explicitly receive incoming data by calling gen_tcp:recv/2,3 or gen_udp:recv/2,3 (depending on the type of socket).

If the value is once ({active, once}), one data message from the socket will be sent to the process. To receive one more message, setopts/2 must be called again with the {active, once} option.

When using {active, once}, the socket changes behaviour automatically when data is received. This can sometimes be confusing in combination with connection oriented sockets (i.e. gen_tcp) as a socket with {active, false} behaviour reports closing differently than a socket with {active, true} behaviour. To make programming easier, a socket where the peer closed and this was detected while in {active, false} mode, will still generate the message {tcp_closed,Socket} when set to {active, once} or {active, true} mode. It is therefore safe to assume that the message {tcp_closed,Socket}, possibly followed by socket port termination (depending on the exit_on_close option) will eventually appear when a socket changes back and forth between {active, true} and {active, false} mode. However, when peer closing is detected is all up to the underlying TCP/IP stack and protocol.

Note that {active,true} mode provides no flow control; a fast sender could easily overflow the receiver with incoming messages. Use active mode only if your high-level protocol provides its own flow control (for instance, acknowledging received messages) or the amount of data exchanged is small. {active,false} mode or use of the {active, once} mode provides flow control; the other side will not be able send faster than the receiver can read.

效率最高的当然是{active, true}方式,因为这种实现一个链接只一次epoll_ctl把socket的读事件挂上去,但是这种方式有致命的缺点。因为收到的包是通过消息的方式来通知我们的,完全是异步的。在正常情况下,没啥问题,但是如果我们的服务面对互联网就有很大的风险,如果遭受攻击的时候,对端发送大量的数据包的时候,我们的系统就会异步收到大量的消息,可能会超过我们的进程处理能力。最要命的是,我们无法让包停止下来,最后的结局就是我们的服务器因为缺少内存crash了。所以在实践中,我们都会用{active,once}方式来控制包的接收频率,这样避免了安全的问题,但是带来了性能的问题。每次设定{active,once}都意味着调用一次epoll_ctl。 如果strace我们的程序会发现有大量的epoll_ctl调用,基本上每秒达到QPS的数量。还有个问题也加剧了这个性能退化:erlang只有一个线程会收割epoll_wait事件,如果大量的ctl时间阻塞了事件的收割,网络处理的能力会大大下降。未来的版本官方计划会支持多个线程收割,但是现在还不行。

所以现在问题就来了,性能和安全如何平衡。Erlang出手拯救我们了,见这里

inet driver add {active,N} socket option for TCP, UDP, and SCTP

这个功能在版本R16b03可用。

解决问题的思路很简单:
{active, true}有安全问题, {active, once}太慢, {active,N}我们一次设定来收N个消息包,摊薄epoll_ctl的代价,这样就可以大大缓解性能的压力。
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Erlang调度器的利用率调查

November 3rd, 2013 6 comments

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: Erlang调度器的利用率调查

Erlang的调度器效率非常高,大概在128核的情况下有80%的利用率,即使是这样,由于CPU和内存体系的结构的限制,调度器的实现还是有大量的锁存在。erts的实现为了避免core scale的问题,通常不会采用锁在那里傻等,而是采用更乐观的无锁算法,这样会有不少的CPU空转现象。

那么如何评估调度器的效率呢?我们可以从系统层面,比如从top看,每个调度器线程忙不忙。但是这只是表象,调度器可能在空转等锁,最靠谱的应该是把调度器真正干活的时间累计起来,比较真实的反应它的效率。

erlang从R15以后提供了调度器的利用率调查,这个函数就是:erlang:statistics(scheduler_wall_time) 。

我们来看下它的文档:

statistics(Item :: scheduler_wall_time) ->
[{SchedulerId, ActiveTime, TotalTime}] | undefined

Types:

SchedulerId = integer() >= 1
ActiveTime = TotalTime = integer() >= 0

Returns a list of tuples with {SchedulerId, ActiveTime, TotalTime}, where SchedulerId is an integer id of the scheduler, ActiveTime is the duration the scheduler has been busy, TotalTime is the total time duration since scheduler_wall_time activation. The time unit is not defined and may be subject to change between releases, operating systems and system restarts. scheduler_wall_time should only be used to calculate relative values for scheduler-utilization. ActiveTime can never exceed TotalTime.

The definition of a busy scheduler is when it is not idle or not scheduling (selecting) a process or port, meaning; executing process code, executing linked-in-driver or NIF code, executing built-in-functions or any other runtime handling, garbage collecting or handling any other memory management. Note, a scheduler may also be busy even if the operating system has scheduled out the scheduler thread.

Returns undefined if the system flag scheduler_wall_time is turned off.

The list of scheduler information is unsorted and may appear in different order between calls.

Using scheduler_wall_time to calculate scheduler utilization.

> erlang:system_flag(scheduler_wall_time, true).
false
> Ts0 = lists:sort(erlang:statistics(scheduler_wall_time)), ok.
ok
Some time later we will take another snapshot and calculate scheduler-utilization per scheduler.

> Ts1 = lists:sort(erlang:statistics(scheduler_wall_time)), ok.
ok
> lists:map(fun({{I, A0, T0}, {I, A1, T1}}) ->
{I, (A1 – A0)/(T1 – T0)} end, lists:zip(Ts0,Ts1)).
[{1,0.9743474730177548},
{2,0.9744843782751444},
{3,0.9995902361669045},
{4,0.9738012596572161},
{5,0.9717956667018103},
{6,0.9739235846420741},
{7,0.973237033077876},
{8,0.9741297293248656}]
Using the same snapshots to calculate a total scheduler-utilization.

> {A, T} = lists:foldl(fun({{_, A0, T0}, {_, A1, T1}}, {Ai,Ti}) ->
{Ai + (A1 – A0), Ti + (T1 – T0)} end, {0, 0}, lists:zip(Ts0,Ts1)), A/T.
0.9769136803764825

其中要注意的是”scheduler_wall_time is by default disabled. Use erlang:system_flag(scheduler_wall_time, true) to enable it.”。原因是运行期需要去做统计工作会影响性能。而且函数返回的每个调度器的使用情况顺序是乱的,需要排序下。

percept2提供了个percept2_sampling来帮我们可视化这个利用率, 演示如下:

我们启动percept2_sampling收集系统一分钟的数据,然后用web界面查看:

$ erl -pa percept2/ebin
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
1> percept2:start_webserver(8933).
{started,"rds064076",8933}
2> percept2_sampling:start([all], 60000, ".").
<0.57.0>

它的操作界面如下:
Percept2_sampling

调度器的利用率效果如图:
sug

我们可以看到3号调度器比较忙,其他的都闲的。

祝玩得开心!

Post Footer automatically generated by wp-posturl plugin for wordpress.