源码分析 | 系统技术非业余研究

crash_dump中Taints项含义

July 2nd, 2013 Yu Feng Comments off

原创文章，转载请注明： 转载自系统技术非业余研究

erlang的应用程序出问题的时候通常会生成一个dump文件，这个文件很好的保存了当时的现场，很利于后面诊断问题。比如：

=erl_crash_dump:0.1
Sun Jun 23 15:35:39 2013
Slogan: Kernel pid terminated (application_controller) ({application_start_failure,ump_proxy,{bad_return,{{ump_proxy_app,start,[normal,[]]},{‘EXIT’,{undef,[{cherly_server,start_link,[ump_proxy_cherly_server
System version: Erlang R15B02 (erts-5.9.2) [64-bit] [smp:16:16] [async-threads:5] [hipe] [kernel-poll:true]
Compiled: Fri Sep 14 13:23:22 2012
Taints: ump_proxy_partitioner_nifs,asn1rt_nif,crypto,dyntrace,ump_la_nifs
Atoms: 37857

crash_dump也开门见山的描述了出错的原因，系统的版本，编译时间，原子的个数，还有一个叫做Taints的东西。
前面这几个都很好理解，原子的个数也能理解，毕竟原子表是有大小限制的，爆了一定要crash的。那这个Taints是哪路神仙，也要列在这么重要的位置。
从字面上理解，Taints的意思就是污染的意思，再看它的内容很明显都是nif模块的名字。
看到这些nif列表,大概就能明白了。由于nif是在vm里面运作的, 如果有bug或者问题，就会直接把vm挂掉，所以官方信任自己的vm代码，系统出了问题，第一时间就会怀疑到用户写的nif代码，也是很自然的。

Erlang开发组的人回答是：

The idea was a way to see all user libraries that has ever been loaded
and executed by the VM. Currently it only shows NIF libraries, but
driver names will hopefully be added as well in some future release.

/Sverker, Erlang/OTP Ericsson

我们从代码验证下：
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析 Tags: erl_crash_dump, taints

Erlang节点重启导致的incarnation问题

June 29th, 2013 Yu Feng 7 comments

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: Erlang节点重启导致的incarnation问题

今天晚上mingchaoyan同学在线上问以下这个问题：

152489 =ERROR REPORT==== 2013-06-28 19:57:53 ===
152490 Discarding message {send,<<19 bytes>>} from <0.86.1> to <0.6743.0> in an old incarnation (1 ) of this node (2)
152491
152492
152493 =ERROR REPORT==== 2013-06-28 19:57:55 ===
152494 Discarding message {send,<<22 bytes>>} from <0.1623.1> to <0.6743.0> in an old incarnation (1) of this node (2

我们中午服务器更新后，日志上满屏的这些错误，请问您有遇到过类似的错误吗？或者提过些定位问题，解决问题的思路，谢谢

这个问题有点意思，从日志提示来再结合源码来看，马上我们就可以找到打出这个提示的地方：

/*bif.c*/
Sint
do_send(Process *p, Eterm to, Eterm msg, int suspend) {
    Eterm portid;
...
} else if (is_external_pid(to)) {
        dep = external_pid_dist_entry(to);
        if(dep == erts_this_dist_entry) {
            erts_dsprintf_buf_t *dsbufp = erts_create_logger_dsbuf();
            erts_dsprintf(dsbufp,
                          "Discarding message %T from %T to %T in an old "
                          "incarnation (%d) of this node (%d)\n",
                          msg,
                          p->id,
                          to,
                          external_pid_creation(to),
                          erts_this_node->creation);
            erts_send_error_to_logger(p->group_leader, dsbufp);
            return 0;
        }
..
}

触发这句警告提示必须满足以下条件：
1. 目标Pid必须是external_pid。
2. 该pid归宿的外部节点所对应的dist_entry和当前节点的dist_entry相同。

通过google引擎，我找到了和这个描述很相近的问题：参见这里，该作者很好的描述和重现了这个现象，但是他没有解释出具体的原因。

好，那我们顺着他的路子来重新下这个问题.
但演示之前，我们先巩固下基础，首先需要明白pid的格式：
可以参见这篇文章：

pid的核心内容摘抄如下：

Printed process ids < A.B.C > are composed of [6]:
A, the node number (0 is the local node, an arbitrary number for a remote node)
B, the first 15 bits of the process number (an index into the process table) [7]
C, bits 16-18 of the process number (the same process number as B) [7]

再参见Erlang External Term Format 文档的章节9.10
描述了PID_EXT的组成：

1 N 4 4 1
103 Node ID Serial Creation
Table 9.16:
Encode a process identifier object (obtained from spawn/3 or friends). The ID and Creation fields works just like in REFERENCE_EXT, while the Serial field is used to improve safety. In ID, only 15 bits are significant; the rest should be 0.

erlang coredump问题

June 27th, 2013 Yu Feng 2 comments

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: erlang coredump问题

早上成立涛同学问道：

: :)我们最近发生了几次宕机。。节点无缘无故就没有了。也没有crash dump，也不知道任何线索。

我们知道erlang的VM在正常运作的时候，如果发现erlang程序的异常或者虚拟机资源不够如内存不够的时候，会产生erl_crash.dump文件，里面把crash的原因和上下文描述的非常清楚，定位问题起来就很容易。但是vm本身是c实现的，如果vm的实现有bug或者系统用到了自己写的nif，这个情况下就很容易把vm搞挂了。 vm都挂了，就不再可能还有机会产生erl_crash.dump.
所以这时候应该产生的是操作系统的core，碰巧如果系统的coredump没开，那么节点就会看起来无缘无故的消失了。

我摘取我们的个案给大家看下:我们在erlang系统里面用到了nif, 这个nif不是多线程安全的，所以在运作的时候产生问题了，搞垮了beam：
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, Linux, 源码分析 Tags: coredump

产生crashdump的三种方法

June 16th, 2013 Yu Feng 2 comments

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: 产生crashdump的三种方法

crashdump对于erlang的系统来讲如同core对于c/++程序一样宝贵，对于系统问题的修复提供了最详细的资料。当然erlang很贴心了提供了网页版的crashdump_view帮助用户解读数据，使用方法如下：crashdump_viewer:start().

因为crashdump文本文件里面记录了大量系统相关的信息，这些信息对于分析系统的性能，状态，排除问题提供了不可替代的功能。所以很需要在系统正常运作的时候，得到crashdump文件。

除了坐等系统有问题自动产生crashdump以外，另外还有二种方法来手动产生crashdump。

方法如下：
1. erlang:halt(“abort”).
2. 在erlang shell下输入CTRL C + “大写的A”

演示如下：

$ erl
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
1> 
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
A

Crash dump was written to: erl_crash.dump
Crash dump requested by userAborted

$ erl
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
1> erlang:halt("abort").

Crash dump was written to: erl_crash.dump
abort

祝玩得开心！

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析 Tags:

gen_tcp发送缓冲区以及水位线问题分析

May 15th, 2013 Yu Feng 7 comments

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: gen_tcp发送缓冲区以及水位线问题分析

前段时间有同学在线上问了个问题：

服务器端我是这样设的：gen_tcp:listen(8000, [{active, false}, {recbuf,1}, {buffer,1}]).
客户端是这样设的：gen_tcp:connect(“localhost”, 8000, [{active, false}， {high_watermark,2}, {low_watermark,1}, {sndbuf,1}, {buffer,1}]).
我客户端每次gen_tcp:send()发送一个字节，前6个字节返回ok，第7个字节阻塞
服务端每次gen_tcp:recv(_,0)接收一个字节，接收三个字节后，客户端的第7次发送返回。
按我的理解的话：应该是服务器端可以接收2个字节+sndbuf里的一个字节，第4个字节客户端就该阻塞的，可事实不时这样，求分析

这个问题确实还是比较复杂，涉及到gen_tcp的发送缓冲区和接收缓冲区，水位线等问题，其中接收缓冲区的问题在这篇以及这篇博文里面讲的比较清楚了，今天我们重点来分析下发送缓冲区和水位线的问题。

在开始分析前，我们需要熟悉几个gen_tcp的选项, 更多参见这里：
Read more…

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析 Tags: +spp, delay_send, gen_tcp, watermark, 水位线

gen_tcp连接半关闭问题

May 14th, 2013 Yu Feng Comments off

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: gen_tcp连接半关闭问题

很久之前我发在javaeye论坛上，预防丢了抄过来：

原文：http://erlang.group.iteye.com/group/wiki/1422-gen_tcp-half-closed

当tcp对端调用shutdown(RD/WR) 时候，宿主进程默认将收到{tcp_closed, Socket}消息，如果这个行为不是你想要的，那么请看：

shutdown(Socket, How) -> ok | {error, Reason}
Types:
Socket = socket()
How = read | write | read_write
Reason = posix()

Immediately close a socket in one or two directions.
How == write means closing the socket for writing, reading from it is still possible.
To be able to handle that the peer has done a shutdown on the write side, the {exit_on_close, false} option is useful.

简单的设置inets:setopts(Socket, [{exit_on_close, false}]). 这样就不会被强制退出了。

祝玩得开心！

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析 Tags: exit_on_close, gen_tcp

Erlang gen_tcp相关问题汇编索引

May 14th, 2013 Yu Feng 2 comments

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: Erlang gen_tcp相关问题汇编索引

gen_tcp是erlang做网络应用最核心的一个模块，实践中使用起来会有很多问题，我把团队和我自己过去碰到的问题汇编下，方便大家对症下药.

以下是gen_tcp,tcp,port相关的博文：

待续，欢迎补充！

祝玩得开心！

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析 Tags: gen_tcp, port, TCP

系统技术非业余研究

Archive

crash_dump中Taints项含义

Erlang节点重启导致的incarnation问题

erlang coredump问题

产生crashdump的三种方法

gen_tcp发送缓冲区以及水位线问题分析

gen_tcp连接半关闭问题

Erlang gen_tcp相关问题汇编索引

buy me a coffee.

Recent Posts

Recent Comments

Categories

Blogroll

Archives

Meta