heart低级bug修复过程
原创文章,转载请注明: 转载自系统技术非业余研究
本文链接地址: heart低级bug修复过程
昨天晚上@华侨E 同学在微博上问了个问题:
想跟你探讨下Erlang heart的一个问题?就是打开启heart的时候,调用heart:set_cmd/1了设置自启命令后,如果这个命令字符长度大于128以上的时候,再调用heart:get_cmd/0时就会获取不到上面设置的命令,接着引起系统挂起,并且与beam通讯60秒超时,然后发生重启,看了heart.c的代码也没发现什么问题。不知道你有什么思路?
heart是Erlang系统可靠性最后的防线,如果有问题后果很严重的, 晚节不保。
我们马上来重现下这个问题:
$ erl -heart
heart_beat_kill_pid = 29045
Erlang R17A (erts-5.11) [source-18d4e3e] [64-bit] [smp:16:16] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V5.11 (abort with ^G)
1> Cmd=string:copies("a",128).
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
2> heart:set_cmd(Cmd).
ok
3> heart:get_cmd().
heart: Wed Oct 16 10:18:20 2013: heart-beat time-out, no activity for 63 seconds
Killed
$ sh: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: command not found
heart: Wed Oct 16 10:18:21 2013: Executed "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" -> 32512. Terminating.
果然调用heart:get_cmd的进程被挂起,63秒后系统vm进程被heart杀掉,试图重新启动一个新进程。上面的实验可以验证2个事情:
1. heart:get_cmd调用在Cmd超过128长度的时候被挂起
2. heart:set_cmd的结果是对的,我们设定的Cmd即使超过128也是正常的。
3. heart重启机制是正常的。
我们接着调查,祭出我们的利器 dbg ,来帮忙看下为什么get_cmd被挂起。
我们用dbg来跟踪下heart模块的函数调用情况:
$ erl -heart
heart_beat_kill_pid = 24249
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]
Eshell V5.9.3.1 (abort with ^G)
1> Cmd=string:copies("a",128).
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
2> heart:set_cmd(Cmd).
ok
3> dbg:tracer(),dbg:p(all, [call]),dbg:tpl(heart, [{'_', [], [{return_trace}]}]).
{ok,[{matched,nonode@nohost,22},{saved,1}]}
4> erlang:whereis(heart).
(<0.4.0>)
5> (<0.4.0>) call heart:loop(<0.0.0>,#Port<0.11>,"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa")
(<0.4.0>) call heart:send_heart_beat(#Port<0.11>)
(<0.4.0>) returned from heart:send_heart_beat/1 -> {<0.4.0>,{command,[2]}}
5> (<0.4.0>) call heart:loop(<0.0.0>,#Port<0.11>,"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa")
(<0.4.0>) call heart:send_heart_beat(#Port<0.11>)
(<0.4.0>) returned from heart:send_heart_beat/1 -> {<0.4.0>,{command,[2]}}
5> heart:get_cmd().
(<0.31.0>) call heart:get_cmd()
(<0.31.0>) call heart:wait()
(<0.4.0>) call heart:get_heart_cmd(#Port<0.11>)
heart: Wed Oct 16 10:32:24 2013: heart-beat time-out, no activity for 63 seconds
Killed
$ sh: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: command not found
heart: Wed Oct 16 10:32:25 2013: Executed "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" -> 32512. Terminating.
从dbg的输出,我们很清晰的看到调用线程(<0.31.0>)堵在heart:wait()上,而heart这个进程(<0.4.0>)堵在heart:get_heart_cmd(#Port<0.11>)上,而且在堵之前,heart进程会定期发心跳包。正是这个心跳包保证heart机制的正常运作,在heart进程被
get_heart_cmd堵死之后,心跳包没了,导致63秒后重启的悲剧。
我们接着核对下代码:
wait() ->
receive
{heart, Res} ->
Res
end.
get_heart_cmd(Port) ->
Port ! {self(), {command, [?GET_CMD]}},
receive
{Port, {data, [?HEART_CMD | Cmd]}} ->
{ok, Cmd}
end.
get_heart_cmd在给Port发了get_cmd命令以后就等在port回应上了,但是明显,port没有回复任何消息。
我们进一步再查看heart port的代码, 见erts/etc/common/heart.c
/*
* message loop
*/
static int
message_loop(erlin_fd, erlout_fd)
int erlin_fd, erlout_fd;
{
...
case GET_CMD:
/* send back command string */
{
char *env = NULL;
char *command
= (cmd[0]
? (char *)cmd
: (env = get_env(HEART_COMMAND_ENV)));
/* Not set and not in env return "" */
if (!command) command = "";
heart_cmd_reply(erlout_fd, command);
free_env_val(env);
}
break;
...
}
/*
* send back current command
*
* Sends an HEART_CMD.
*/
static int
heart_cmd_reply(int fd, char *s)
{
struct msg m;
int len = strlen(s);
/* if s >= MSG_BODY_SIZE, return a write
* failure immediately.
*/
if (len >= sizeof(m.fill))
return -1;
m.op = HEART_CMD;
m.len = htons(len + 1); /* Include Op */
strcpy((char*)m.fill, s);
return write_message(fd, &m);
}
static int
write_message(fd, mp)
int fd;
struct msg *mp;
{
int len;
char* tmp;
tmp = (char*) &(mp->len);
len = (*tmp * 256) + *(tmp+1);
if ((len == 0) || (len > MSG_BODY_SIZE)) {
return MSG_HDR_SIZE;
} /* cc68k wants (char *) */
if (write(fd, (char *) mp, len + MSG_HDR_SIZE) != len + MSG_HDR_SIZE) {
return -1;
}
return len + MSG_HDR_SIZE;
}
get_cmd关联函数只有这3个,很明显的看的出来,在长度超过128时候, char* tmp会出问题。
我们写个代码来验证下:
#include <arpa/inet.h>
#include <stdio.h>
void test(unsigned short mp_len)
{
int len;
unsigned char* tmp;
// char* tmp;
tmp = (char*) &(mp_len);
len = (*tmp * 256) + *(tmp+1);
fprintf(stdout, "%d\n", len);
}
int main(int argc, char**argv)
{
test( htons(128 + 1));
}
看下在128长度下len 二种情况下的值:
#unsigned char* tmp版本 $ gcc x.c && ./a.out 129 #char* tmp版本 $ gcc x.c && ./a.out -127
所以len变成-127的情况下,下面语句必然出错,port没按预期的写出回包:
if (write(fd, (char *) mp, len + MSG_HDR_SIZE) != len + MSG_HDR_SIZE) {
return -1;
}
找到原因,我们修正重新编译,验证下:
$ bin/erl -heart
heart_beat_kill_pid = 7986
Erlang R17A (erts-5.11) [source-18d4e3e] [64-bit] [smp:16:16] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V5.11 (abort with ^G)
1> Cmd=string:copies("a",128).
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
2> heart:set_cmd(Cmd).
ok
3> heart:get_cmd().
{ok,"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"}
4>
这把正常工作了,这个bug存在很广,分布在erlang的各个版本,大家要小心下。
小结:dbg很无敌。
祝玩得开心!
Post Footer automatically generated by wp-posturl plugin for wordpress.
受益良多
话说这种signed unsigned引起的符号问题真是数不胜数 🙂
这个定位问题的过程以及背后的debug思路对我非常有帮助,都是经验啊。
多谢!
Yu Feng Reply:
October 28th, 2013 at 3:06 pm
多谢支持!