Home > Erlang探索, 源码分析 > heart低级bug修复过程

heart低级bug修复过程

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: heart低级bug修复过程

昨天晚上@华侨E 同学在微博上问了个问题:

想跟你探讨下Erlang heart的一个问题?就是打开启heart的时候,调用heart:set_cmd/1了设置自启命令后,如果这个命令字符长度大于128以上的时候,再调用heart:get_cmd/0时就会获取不到上面设置的命令,接着引起系统挂起,并且与beam通讯60秒超时,然后发生重启,看了heart.c的代码也没发现什么问题。不知道你有什么思路?

heart是Erlang系统可靠性最后的防线,如果有问题后果很严重的, 晚节不保。

我们马上来重现下这个问题:

$ erl -heart
heart_beat_kill_pid = 29045
Erlang R17A (erts-5.11) [source-18d4e3e] [64-bit] [smp:16:16] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.11  (abort with ^G)
1> Cmd=string:copies("a",128).
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
2> heart:set_cmd(Cmd).
ok
3> heart:get_cmd().   
heart: Wed Oct 16 10:18:20 2013: heart-beat time-out, no activity for 63 seconds
Killed
$ sh: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: command not found
heart: Wed Oct 16 10:18:21 2013: Executed "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" -> 32512. Terminating.

果然调用heart:get_cmd的进程被挂起,63秒后系统vm进程被heart杀掉,试图重新启动一个新进程。上面的实验可以验证2个事情:
1. heart:get_cmd调用在Cmd超过128长度的时候被挂起
2. heart:set_cmd的结果是对的,我们设定的Cmd即使超过128也是正常的。
3. heart重启机制是正常的。

我们接着调查,祭出我们的利器 dbg ,来帮忙看下为什么get_cmd被挂起。
我们用dbg来跟踪下heart模块的函数调用情况:

$ erl -heart
heart_beat_kill_pid = 24249
Erlang R15B03 (erts-5.9.3.1)  [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
1> Cmd=string:copies("a",128).
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
2>  heart:set_cmd(Cmd).
ok
3> dbg:tracer(),dbg:p(all, [call]),dbg:tpl(heart, [{'_', [], [{return_trace}]}]). 
{ok,[{matched,nonode@nohost,22},{saved,1}]}
4> erlang:whereis(heart).
(<0.4.0>) 
5> (<0.4.0>) call heart:loop(<0.0.0>,#Port<0.11>,"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa")
(<0.4.0>) call heart:send_heart_beat(#Port<0.11>)
(<0.4.0>) returned from heart:send_heart_beat/1 -> {<0.4.0>,{command,[2]}}

5> (<0.4.0>) call heart:loop(<0.0.0>,#Port<0.11>,"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa")
(<0.4.0>) call heart:send_heart_beat(#Port<0.11>)
(<0.4.0>) returned from heart:send_heart_beat/1 -> {<0.4.0>,{command,[2]}}

5> heart:get_cmd().
(<0.31.0>) call heart:get_cmd()
(<0.31.0>) call heart:wait()
(<0.4.0>) call heart:get_heart_cmd(#Port<0.11>)
heart: Wed Oct 16 10:32:24 2013: heart-beat time-out, no activity for 63 seconds
Killed
$ sh: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: command not found
heart: Wed Oct 16 10:32:25 2013: Executed "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" -> 32512. Terminating.

从dbg的输出,我们很清晰的看到调用线程(<0.31.0>)堵在heart:wait()上,而heart这个进程(<0.4.0>)堵在heart:get_heart_cmd(#Port<0.11>)上,而且在堵之前,heart进程会定期发心跳包。正是这个心跳包保证heart机制的正常运作,在heart进程被
get_heart_cmd堵死之后,心跳包没了,导致63秒后重启的悲剧。

我们接着核对下代码:

wait() ->
    receive
	{heart, Res} ->
	    Res
    end.

get_heart_cmd(Port) ->
    Port ! {self(), {command, [?GET_CMD]}},
    receive
	{Port, {data, [?HEART_CMD | Cmd]}} ->
	    {ok, Cmd}
    end.

get_heart_cmd在给Port发了get_cmd命令以后就等在port回应上了,但是明显,port没有回复任何消息。
我们进一步再查看heart port的代码, 见erts/etc/common/heart.c

/*
 * message loop
 */
static int
message_loop(erlin_fd, erlout_fd)
     int   erlin_fd, erlout_fd;
{
...
	case GET_CMD:
				/* send back command string */
			        {
				    char *env = NULL;
				    char *command
					= (cmd[0]
					   ? (char *)cmd
					   : (env = get_env(HEART_COMMAND_ENV)));
				    /* Not set and not in env  return "" */
				    if (!command) command = "";
				    heart_cmd_reply(erlout_fd, command);
				    free_env_val(env);
				}
			        break;
...
}
/*
 * send back current command
 *
 * Sends an HEART_CMD.
 */
static int
heart_cmd_reply(int fd, char *s)
{
  struct msg m;
  int len = strlen(s);

  /* if s >= MSG_BODY_SIZE, return a write
   * failure immediately.
   */
  if (len >= sizeof(m.fill))
      return -1;

  m.op = HEART_CMD;
  m.len = htons(len + 1);	/* Include Op */
  strcpy((char*)m.fill, s);

  return write_message(fd, &m);
}

static int
write_message(fd, mp)
  int   fd;
  struct msg *mp;
{
  int   len;
  char* tmp;

  tmp = (char*) &(mp->len);
  len = (*tmp * 256) + *(tmp+1); 
  if ((len == 0) || (len > MSG_BODY_SIZE)) {
    return MSG_HDR_SIZE;
  }				/* cc68k wants (char *) */
  if (write(fd, (char *) mp, len + MSG_HDR_SIZE) != len + MSG_HDR_SIZE) {
    return -1;
  }
  return len + MSG_HDR_SIZE;
}

get_cmd关联函数只有这3个,很明显的看的出来,在长度超过128时候, char* tmp会出问题。
我们写个代码来验证下:

#include <arpa/inet.h>
#include <stdio.h>
void test(unsigned short mp_len)
{
int   len;
unsigned char* tmp;
// char* tmp;
tmp = (char*) &(mp_len);
len = (*tmp * 256) + *(tmp+1);

fprintf(stdout, "%d\n", len);
}

int main(int argc, char**argv)
{
  test( htons(128 + 1));
}

看下在128长度下len 二种情况下的值:

#unsigned char* tmp版本
$ gcc x.c && ./a.out
129

#char* tmp版本
$ gcc x.c && ./a.out
-127

所以len变成-127的情况下,下面语句必然出错,port没按预期的写出回包:

 if (write(fd, (char *) mp, len + MSG_HDR_SIZE) != len + MSG_HDR_SIZE) {
    return -1;
  }

找到原因,我们修正重新编译,验证下:

$ bin/erl -heart
heart_beat_kill_pid = 7986
Erlang R17A (erts-5.11) [source-18d4e3e] [64-bit] [smp:16:16] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.11  (abort with ^G)
1> Cmd=string:copies("a",128).
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
2> heart:set_cmd(Cmd).
ok
3> heart:get_cmd().   
{ok,"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"}
4>

这把正常工作了,这个bug存在很广,分布在erlang的各个版本,大家要小心下。

小结:dbg很无敌。

祝玩得开心!

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, 源码分析 Tags: ,
  1. fxmy
    October 16th, 2013 at 22:37 | #1

    受益良多
    话说这种signed unsigned引起的符号问题真是数不胜数 🙂

    [Reply]

  2. huabin zhang
    October 26th, 2013 at 05:39 | #2

    这个定位问题的过程以及背后的debug思路对我非常有帮助,都是经验啊。
    多谢!

    [Reply]

    Yu Feng Reply:

    多谢支持!

    [Reply]

  1. No trackbacks yet.