Erlang节点重启导致的incarnation问题

Home > Erlang探索, 源码分析 > Erlang节点重启导致的incarnation问题

June 29th, 2013 Yu Feng

原创文章，转载请注明： 转载自系统技术非业余研究

今天晚上mingchaoyan同学在线上问以下这个问题：

152489 =ERROR REPORT==== 2013-06-28 19:57:53 ===
152490 Discarding message {send,<<19 bytes>>} from <0.86.1> to <0.6743.0> in an old incarnation (1 ) of this node (2)
152491
152492
152493 =ERROR REPORT==== 2013-06-28 19:57:55 ===
152494 Discarding message {send,<<22 bytes>>} from <0.1623.1> to <0.6743.0> in an old incarnation (1) of this node (2

我们中午服务器更新后，日志上满屏的这些错误，请问您有遇到过类似的错误吗？或者提过些定位问题，解决问题的思路，谢谢

这个问题有点意思，从日志提示来再结合源码来看，马上我们就可以找到打出这个提示的地方：

/*bif.c*/
Sint
do_send(Process *p, Eterm to, Eterm msg, int suspend) {
    Eterm portid;
...
} else if (is_external_pid(to)) {
        dep = external_pid_dist_entry(to);
        if(dep == erts_this_dist_entry) {
            erts_dsprintf_buf_t *dsbufp = erts_create_logger_dsbuf();
            erts_dsprintf(dsbufp,
                          "Discarding message %T from %T to %T in an old "
                          "incarnation (%d) of this node (%d)\n",
                          msg,
                          p->id,
                          to,
                          external_pid_creation(to),
                          erts_this_node->creation);
            erts_send_error_to_logger(p->group_leader, dsbufp);
            return 0;
        }
..
}

触发这句警告提示必须满足以下条件：
1. 目标Pid必须是external_pid。
2. 该pid归宿的外部节点所对应的dist_entry和当前节点的dist_entry相同。

通过google引擎，我找到了和这个描述很相近的问题：参见这里，该作者很好的描述和重现了这个现象，但是他没有解释出具体的原因。

好，那我们顺着他的路子来重新下这个问题.
但演示之前，我们先巩固下基础，首先需要明白pid的格式：
可以参见这篇文章：

pid的核心内容摘抄如下：

Printed process ids < A.B.C > are composed of [6]:
A, the node number (0 is the local node, an arbitrary number for a remote node)
B, the first 15 bits of the process number (an index into the process table) [7]
C, bits 16-18 of the process number (the same process number as B) [7]

再参见Erlang External Term Format 文档的章节9.10
描述了PID_EXT的组成：

1 N 4 4 1
103 Node ID Serial Creation
Table 9.16:
Encode a process identifier object (obtained from spawn/3 or friends). The ID and Creation fields works just like in REFERENCE_EXT, while the Serial field is used to improve safety. In ID, only 15 bits are significant; the rest should be 0.

我们可以看到一个字段 Creation，这个东西我们之前怎么没见过呢？
参考erlang的文档我们可以知道：

creation
Returns the creation of the local node as an integer. The creation is changed when a node is restarted. The creation of a node is stored in process identifiers, port identifiers, and references. This makes it (to some extent) possible to distinguish between identifiers from different incarnations of a node. Currently valid creations are integers in the range 1..3, but this may (probably will) change in the future. If the node is not alive, 0 is returned.

追踪这个creation的来源，我们知道这个变量来自epmd. 具体点的描述就是每次节点都会像epmd注册名字，epmd会给节点返回这个creation. net_kernel会把这个creation通过set_node这个bif登记到该节点的erts_this_dist_entry->creation中去:

/* erl_node_tables.c */
void
erts_set_this_node(Eterm sysname, Uint creation)
{
...
    erts_this_dist_entry->sysname = sysname;
    erts_this_dist_entry->creation = creation;
...
}

/*epmd_srv.c  */
...
        /* When reusing we change the "creation" number 1..3 */

        node->creation = node->creation % 3 + 1;
...

从上面的代码可以看出creation取值是1-3，每次登记的时候+1. 未联网的节点creation为0.

知道了createion的来龙去脉后，我们再看下DistEntry的数据结构，这个数据结构基本上代表了联网的节点和外面世界的交互。

typedef struct dist_entry_ {
…
Eterm sysname; /* name@host atom for efficiency */
Uint32 creation; /* creation of connected node */
Eterm cid; /* connection handler (pid or port), NIL == free
…
} DistEntry;
其中最重要的信息有上面3个，其中cid代表port(节点之间的TCP通道).

我们知道外部pid是通过binary_to_term来构造的，代码位于external.c:dec_pid函数。

static byte*
dec_pid(ErtsDistExternal *edep, Eterm** hpp, byte* ep, ErlOffHeap* off_heap, Eterm* objp)
{
 ...
    /*                                                                                                                    
     * We are careful to create the node entry only after all                                                             
     * validity tests are done.                                                                                           
     */
    node = dec_get_node(sysname, cre);

    if(node == erts_this_node) {
        *objp = make_internal_pid(data);
    } else {
        ExternalThing *etp = (ExternalThing *) *hpp;
        *hpp += EXTERNAL_THING_HEAD_SIZE + 1;

        etp->header = make_external_pid_header(1);
        etp->next = off_heap->first;
        etp->node = node;
        etp->data.ui[0] = data;

        off_heap->first = (struct erl_off_heap_header*) etp;
        *objp = make_external_pid(etp);
    }
...
}
static ERTS_INLINE ErlNode* dec_get_node(Eterm sysname, Uint creation)
{
    switch (creation) {
    case INTERNAL_CREATION:
        return erts_this_node;
    case ORIG_CREATION:
        if (sysname == erts_this_node->sysname) {
            creation = erts_this_node->creation;
        }
    }
    return erts_find_or_insert_node(sysname,creation);
}

如果creation等0的话，肯定是本地节点，否则根据sysname和creation来找到一个匹配的节点。
继续上代码：

typedef struct erl_node_ {
  HashBucket hash_bucket;       /* Hash bucket */
  erts_refc_t refc;             /* Reference count */
  Eterm sysname;                /* name@host atom for efficiency */
  Uint32 creation;              /* Creation */
  DistEntry *dist_entry;        /* Corresponding dist entry */
} ErlNode;

/* erl_node_tables.c */
ErlNode *erts_find_or_insert_node(Eterm sysname, Uint creation)
{    
    ErlNode *res;
    ErlNode ne;
    ne.sysname = sysname;
    ne.creation = creation;

    erts_smp_rwmtx_rlock(&erts_node_table_rwmtx);
    res = hash_get(&erts_node_table, (void *) &ne);
    if (res && res != erts_this_node) {
        erts_aint_t refc = erts_refc_inctest(&res->refc, 0);
        if (refc < 2) /* New or pending delete */
            erts_refc_inc(&res->refc, 1);
    }
    erts_smp_rwmtx_runlock(&erts_node_table_rwmtx);
    if (res)
        return res;

    erts_smp_rwmtx_rwlock(&erts_node_table_rwmtx);
    res = hash_put(&erts_node_table, (void *) &ne);
    ASSERT(res);
    if (res != erts_this_node) {
        erts_aint_t refc = erts_refc_inctest(&res->refc, 0);
        if (refc < 2) /* New or pending delete */
            erts_refc_inc(&res->refc, 1);
    }
    erts_smp_rwmtx_rwunlock(&erts_node_table_rwmtx);
    return res;  
}

static int
node_table_cmp(void *venp1, void *venp2)
{
    return ((((ErlNode *) venp1)->sysname == ((ErlNode *) venp2)->sysname
             && ((ErlNode *) venp1)->creation == ((ErlNode *) venp2)->creation)
            ? 0
            : 1);
}

static void*
node_table_alloc(void *venp_tmpl)
{
    ErlNode *enp;

    if(((ErlNode *) venp_tmpl) == erts_this_node)
        return venp_tmpl;

    enp = (ErlNode *) erts_alloc(ERTS_ALC_T_NODE_ENTRY, sizeof(ErlNode));

    node_entries++;

    erts_refc_init(&enp->refc, -1);
    enp->creation = ((ErlNode *) venp_tmpl)->creation;
    enp->sysname = ((ErlNode *) venp_tmpl)->sysname;
    enp->dist_entry = erts_find_or_insert_dist_entry(((ErlNode *) venp_tmpl)->sysname);

    return (void *) enp;
}

这个erts_find_or_insert_node会根据sysname和creation的组合来查找节点，如果找不到的话，会新建一个节点放入ErlNode类型的erts_node_table表中。而ErlNode有3个关键信息 1. sysname 2. creation 3. dist_entry。新建一个节点的时候，dist_entry填什么呢？

核心代码是这行：
enp->dist_entry = erts_find_or_insert_dist_entry(((ErlNode *) venp_tmpl)->sysname);
这个dist_entry是根据sysname查找到的，而不是依据sysname和creation的组合。

这时候问题就来了，我们仔细看下 dec_pid的代码：

node = dec_get_node(sysname, cre);
if(node == erts_this_node) {
*objp = make_internal_pid(data);
} else {
…
etp->node = node;
…
*objp = make_external_pid(etp);
}

由于creation不同，所以相同的sysname，无法找到目前的节点。在新建的节点里面，它的dist_entry却是当前节点对应的dist_entry.
创建出来的外部pid对象包含新建的node。

所以send的时候出警告的三句代码：

} else if (is_external_pid(to)) {
dep = external_pid_dist_entry(to);
if(dep == erts_this_dist_entry) {

external_pid_dist_entry宏会从外部pid中取出node,再从node中取出dist_entry. 这个dist_entry很不幸的和erts_this_dist_entry相同，于是就有了上面的悲剧。

分析了半天总算有眉目了，喝口水先！
现在有了这些背景知识我们就可以演示了：

$ erl -sname a
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
(a@rds064076)1> term_to_binary(self()).
<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,
  0,0,37,0,0,0,0,1>>
(a@rds064076)2> erlang:system_info(creation).
1

binary的最后一个字节是creation和通过erlang:system_info(creation)获取的creation是吻合的。
重新启动下a节点，这时候creation应该是2.

$ erl -sname a
Erlang R15B03 (erts-5.9.3.1) [source][/source] [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
(a@rds064076)1> term_to_binary(self()).
<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,
  0,0,37,0,0,0,0,2>>
(a@rds064076)2> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,2>>).
<0.37.0>
(a@rds064076)3> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,3>>).
<0.37.0>
(a@rds064076)4> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,1>>).
<0.37.0>
(a@rds064076)5> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,1>>)==self(). 
false
(a@rds064076)6> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,2>>)==self().
true
(a@rds064076)7> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,3>>)==self().
false
(a@rds064076)8> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,3>>)!ok.     
ok
(a@rds064076)9> erlang:system_info(creation).
2
=ERROR REPORT==== 28-Jun-2013::23:10:58 ===
Discarding message ok from <0.37.0> to <0.37.0> in an old incarnation (3) of this node (2)

上面的演示我们可以看出，creation确实是每次+1循环，同时虽然pid打出来的是一样的，但是实际上由于creation的存在，看起来一样的还是不同的pid.
到这里，我们大概明白了前应后果。但是并没有回到上面同学的疑问。
他的集群，只是重新启动了个节点，然后收到一屏幕的警告。
注意是一屏！！！

我重新设计了一个案例，在深度剖析这个问题：
在这之前，我们需要以下程序：

$ cat test.erl
-module(test).
-export([start/0]).

start()->
  register(test, self()),
  loop(undefined).

loop(State)->
   loop( 
  receive
   {set, Msg} -> Msg;
   {get, From} -> From!State
   end
   ).

这段代码的目的是：
test:start进程启动起来后，会在目标节点上把自己登记为test名字，同时可以接受2中消息get和set。set会保持用户设置的信息，而get会取回消息。

我们的测试案例是这样的：
启动a,b节点，然后在b节点上通过spawn在a节点上启动test:start这个进程负责保存我们的信息。这个信息就是b进程的shell的进程pid.
然后模拟b节点挂掉重新启动，通过a节点上的test进程取回上次保持的进程pid, 这个pid和新启动的shell pid是相同的，但是他们应该是不完全相同的，因为creation不一样。
好了，交代清楚了，我们就来秀下:

$ erl -name a@127.0.0.1
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
(a@127.0.0.1)1>

好，A节点准备好了，接下来启动B节点保存shell的进程pid到节点a去。

$ erl -name b@127.0.0.1
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
(b@127.0.0.1)1> R=spawn('a@127.0.0.1', test, start,[]).
<6002.42.0>
(b@127.0.0.1)2> self().
<0.37.0>
(b@127.0.0.1)3> R!{set, self()}.   
{set,<0.37.0>}
(b@127.0.0.1)4> R!{get, self()}.
{get,<0.37.0>}
(b@127.0.0.1)5> flush().
Shell got <0.37.0>
ok
(b@127.0.0.1)6> 
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
^C

这时候把节点b退出，模拟b挂掉，再重新启动b,取回之前保存的pid,和现有的shell pid对比，发现不是完全一样。

$ erl -name b@127.0.0.1
Erlang R15B03 (erts-5.9.3.1) [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1 (abort with ^G)
(b@127.0.0.1)1> {test, ‘a@127.0.0.1’}!{get, self()}.
{get,<0.37.0>}
(b@127.0.0.1)2> flush().
Shell got <0.37.0>
ok
(b@127.0.0.1)3> {test, ‘a@127.0.0.1’}!{get, self()}, receive X->X end.
<0.37.0>
(b@127.0.0.1)4> T=v(-1).
<0.37.0>
(b@127.0.0.1)5> T==self().
false
(b@127.0.0.1)6> T!ok.
ok
(b@127.0.0.1)7>
=ERROR REPORT==== 28-Jun-2013::23:24:00 ===
Discarding message ok from <0.37.0> to <0.37.0> in an old incarnation (2) of this node (3)
[/erlang]
我们发消息给取回的上次保持的pid，就触发了警告。

这个场景在分布式环境里面非常普遍，参与协作的进程会保持在其他节点的系统里面，当其中的一些进程挂掉重新启动的时候，试图取回这些进程id的时候，却发现这些id已经失效了。

到这里为止，应该能够很好的回答了上面同学的问题了。

这个问题的解决方案是什么呢？
我们的系统应该去monitor_node其他相关节点并且去捕获nodedown消息，当节点失效的时候，适时移除掉和该节点相关的进程。因为这些进程本质上已经失去功效了。

小结：看起来再无辜的警告，也是会隐藏着重大的问题。

祝玩得开心。

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析 Tags: creation, incarnation, system_info

Comments (5)

huabin zhang

July 16th, 2013 at 15:38 | #1

Reply | Quote

受教了，一直关注您的博客，看到精彩之处实在按耐不住所以浮上来发发感叹。

Yu Feng Reply:
July 16th, 2013 at 8:22 pm
多谢欣赏
宋枭

April 25th, 2014 at 17:38 | #2

Reply | Quote

这篇文章，大赞，解惑啊
shuqin

June 26th, 2014 at 23:46 | #3

Reply | Quote

什么情况需要把协作的进程保存在另外的节点上呢？想不到应用场景

Yu Feng Reply:
June 27th, 2014 at 10:32 pm
很多呀，比如订阅者。
hejavac

September 5th, 2014 at 16:35 | #4

Reply | Quote

赞，遇到一样的问题，看来自己的学习还任重道远啊~~
DenoFiend

November 26th, 2014 at 14:36 | #5

Reply | Quote

重新启动三次就不会报这个警告了。