系统技术非业余研究heart低级bug修复过程

Home > Erlang探索, 源码分析 > heart低级bug修复过程

heart低级bug修复过程

October 16th, 2013 Yu Feng

原创文章，转载请注明： 转载自系统技术非业余研究

昨天晚上@华侨E 同学在微博上问了个问题：

想跟你探讨下Erlang heart的一个问题？就是打开启heart的时候，调用heart:set_cmd/1了设置自启命令后，如果这个命令字符长度大于128以上的时候，再调用heart:get_cmd/0时就会获取不到上面设置的命令，接着引起系统挂起，并且与beam通讯60秒超时，然后发生重启，看了heart.c的代码也没发现什么问题。不知道你有什么思路?

heart是Erlang系统可靠性最后的防线，如果有问题后果很严重的, 晚节不保。

我们马上来重现下这个问题：

$ erl -heart
heart_beat_kill_pid = 29045
Erlang R17A (erts-5.11) [source-18d4e3e] [64-bit] [smp:16:16] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.11  (abort with ^G)
1> Cmd=string:copies("a",128).
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
2> heart:set_cmd(Cmd).
ok
3> heart:get_cmd().   
heart: Wed Oct 16 10:18:20 2013: heart-beat time-out, no activity for 63 seconds
Killed
$ sh: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: command not found
heart: Wed Oct 16 10:18:21 2013: Executed "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" -> 32512. Terminating.

果然调用heart:get_cmd的进程被挂起，63秒后系统vm进程被heart杀掉，试图重新启动一个新进程。上面的实验可以验证2个事情：
1. heart:get_cmd调用在Cmd超过128长度的时候被挂起
2. heart:set_cmd的结果是对的，我们设定的Cmd即使超过128也是正常的。
3. heart重启机制是正常的。

我们接着调查，祭出我们的利器 dbg ，来帮忙看下为什么get_cmd被挂起。
我们用dbg来跟踪下heart模块的函数调用情况：

$ erl -heart
heart_beat_kill_pid = 24249
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
1> Cmd=string:copies("a",128).
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
2>  heart:set_cmd(Cmd).
ok
3> dbg:tracer(),dbg:p(all, [call]),dbg:tpl(heart, [{'_', [], [{return_trace}]}]). 
{ok,[{matched,nonode@nohost,22},{saved,1}]}
4> erlang:whereis(heart).
(<0.4.0>) 
5> (<0.4.0>) call heart:loop(<0.0.0>,#Port<0.11>,"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa")
(<0.4.0>) call heart:send_heart_beat(#Port<0.11>)
(<0.4.0>) returned from heart:send_heart_beat/1 -> {<0.4.0>,{command,[2]}}

5> (<0.4.0>) call heart:loop(<0.0.0>,#Port<0.11>,"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa")
(<0.4.0>) call heart:send_heart_beat(#Port<0.11>)
(<0.4.0>) returned from heart:send_heart_beat/1 -> {<0.4.0>,{command,[2]}}

5> heart:get_cmd().
(<0.31.0>) call heart:get_cmd()
(<0.31.0>) call heart:wait()
(<0.4.0>) call heart:get_heart_cmd(#Port<0.11>)
heart: Wed Oct 16 10:32:24 2013: heart-beat time-out, no activity for 63 seconds
Killed
$ sh: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: command not found
heart: Wed Oct 16 10:32:25 2013: Executed "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" -> 32512. Terminating.

从dbg的输出，我们很清晰的看到调用线程(<0.31.0>)堵在heart:wait()上，而heart这个进程(<0.4.0>)堵在heart:get_heart_cmd(#Port<0.11>)上，而且在堵之前，heart进程会定期发心跳包。正是这个心跳包保证heart机制的正常运作，在heart进程被
get_heart_cmd堵死之后，心跳包没了，导致63秒后重启的悲剧。

我们接着核对下代码：

wait() ->
    receive
	{heart, Res} ->
	    Res
    end.

get_heart_cmd(Port) ->
    Port ! {self(), {command, [?GET_CMD]}},
    receive
	{Port, {data, [?HEART_CMD | Cmd]}} ->
	    {ok, Cmd}
    end.

get_heart_cmd在给Port发了get_cmd命令以后就等在port回应上了，但是明显，port没有回复任何消息。
我们进一步再查看heart port的代码，见erts/etc/common/heart.c

/*
 * message loop
 */
static int
message_loop(erlin_fd, erlout_fd)
     int   erlin_fd, erlout_fd;
{
...
	case GET_CMD:
				/* send back command string */
			        {
				    char *env = NULL;
				    char *command
					= (cmd[0]
					   ? (char *)cmd
					   : (env = get_env(HEART_COMMAND_ENV)));
				    /* Not set and not in env  return "" */
				    if (!command) command = "";
				    heart_cmd_reply(erlout_fd, command);
				    free_env_val(env);
				}
			        break;
...
}
/*
 * send back current command
 *
 * Sends an HEART_CMD.
 */
static int
heart_cmd_reply(int fd, char *s)
{
  struct msg m;
  int len = strlen(s);

  /* if s >= MSG_BODY_SIZE, return a write
   * failure immediately.
   */
  if (len >= sizeof(m.fill))
      return -1;

  m.op = HEART_CMD;
  m.len = htons(len + 1);	/* Include Op */
  strcpy((char*)m.fill, s);

  return write_message(fd, &m);
}

static int
write_message(fd, mp)
  int   fd;
  struct msg *mp;
{
  int   len;
  char* tmp;

  tmp = (char*) &(mp->len);
  len = (*tmp * 256) + *(tmp+1); 
  if ((len == 0) || (len > MSG_BODY_SIZE)) {
    return MSG_HDR_SIZE;
  }				/* cc68k wants (char *) */
  if (write(fd, (char *) mp, len + MSG_HDR_SIZE) != len + MSG_HDR_SIZE) {
    return -1;
  }
  return len + MSG_HDR_SIZE;
}

get_cmd关联函数只有这3个，很明显的看的出来，在长度超过128时候， char* tmp会出问题。
我们写个代码来验证下：

#include <arpa/inet.h>
#include <stdio.h>
void test(unsigned short mp_len)
{
int   len;
unsigned char* tmp;
// char* tmp;
tmp = (char*) &(mp_len);
len = (*tmp * 256) + *(tmp+1);

fprintf(stdout, "%d\n", len);
}

int main(int argc, char**argv)
{
  test( htons(128 + 1));
}

看下在128长度下len 二种情况下的值：

#unsigned char* tmp版本
$ gcc x.c && ./a.out
129

#char* tmp版本
$ gcc x.c && ./a.out
-127

所以len变成-127的情况下，下面语句必然出错，port没按预期的写出回包：

 if (write(fd, (char *) mp, len + MSG_HDR_SIZE) != len + MSG_HDR_SIZE) {
    return -1;
  }

找到原因，我们修正重新编译，验证下：

$ bin/erl -heart
heart_beat_kill_pid = 7986
Erlang R17A (erts-5.11) [source-18d4e3e] [64-bit] [smp:16:16] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.11  (abort with ^G)
1> Cmd=string:copies("a",128).
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
2> heart:set_cmd(Cmd).
ok
3> heart:get_cmd().   
{ok,"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"}
4>

这把正常工作了，这个bug存在很广，分布在erlang的各个版本，大家要小心下。

小结：dbg很无敌。

祝玩得开心！

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析 Tags: get_cmd, heart

Comments (2)

fxmy

October 16th, 2013 at 22:37 | #1

Reply | Quote

受益良多
话说这种signed unsigned引起的符号问题真是数不胜数 🙂
huabin zhang

October 26th, 2013 at 05:39 | #2

Reply | Quote

这个定位问题的过程以及背后的debug思路对我非常有帮助，都是经验啊。
多谢！

Yu Feng Reply:
October 28th, 2013 at 3:06 pm
多谢支持！

Comments are closed.

决定vheap大小的golden ratio算法（1.61803398875）鉴赏调查使用binary最多TOPN进程

系统技术非业余研究

heart低级bug修复过程

buy me a coffee.

Recent Posts

Recent Comments

Categories

Blogroll

Archives

Meta

系统技术非业余研究

heart低级bug修复过程

Share this:

buy me a coffee.

Recent Posts

Recent Comments

Categories

Blogroll

Archives

Meta