Home > Erlang探索, Linux, 源码分析 > erlang coredump问题

erlang coredump问题

June 27th, 2013

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: erlang coredump问题

早上成立涛同学问道:

: :)我们最近发生了几次宕机。。节点无缘无故就没有了。也没有crash dump,也不知道任何线索。

我们知道erlang的VM在正常运作的时候,如果发现erlang程序的异常或者虚拟机资源不够如内存不够的时候,会产生erl_crash.dump文件,里面把crash的原因和上下文描述的非常清楚,定位问题起来就很容易。但是vm本身是c实现的,如果vm的实现有bug或者系统用到了自己写的nif,这个情况下就很容易把vm搞挂了。 vm都挂了,就不再可能还有机会产生erl_crash.dump.
所以这时候应该产生的是操作系统的core,碰巧如果系统的coredump没开,那么节点就会看起来无缘无故的消失了。

我摘取我们的个案给大家看下:我们在erlang系统里面用到了nif, 这个nif不是多线程安全的,所以在运作的时候产生问题了,搞垮了beam:

*** glibc detected *** …/ump_proxy/erts-5.9.2/bin/beam.smp: double free or corruption (fasttop): 0x00002aaad8006780 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3998a7245f]
/lib64/libc.so.6(cfree+0x4b)[0x3998a728bb]
/home/admin/rds2tae/clusters/ump_proxy/lib/cherly-0.12.8/priv/cherly.so(lru_remove_and_destroy+0x19)[0x2aaab846d849]
/home/admin/rds2tae/clusters/ump_proxy/lib/cherly-0.12.8/priv/cherly.so(cherly_remove+0x75)[0x2aaab846b0c5]
/home/admin/rds2tae/clusters/ump_proxy/lib/cherly-0.12.8/priv/cherly.so(cherly_put+0x193)[0x2aaab846b3e3]
/home/admin/rds2tae/clusters/ump_proxy/lib/cherly-0.12.8/priv/cherly.so[0x2aaab846b73d]
/home/admin/rds2tae/clusters/ump_proxy/erts-5.9.2/bin/beam.smp(process_main+0x6774)[0x53b104]
/home/admin/rds2tae/clusters/ump_proxy/erts-5.9.2/bin/beam.smp[0x4a62e3]
/home/admin/rds2tae/clusters/ump_proxy/erts-5.9.2/bin/beam.smp[0x5b4fc9]
/lib64/libpthread.so.0[0x399920673d]
/lib64/libc.so.6(clone+0x6d)[0x3998ad44bd]
======= Memory map: ========
00400000-00603000 r-xp 00000000 68:09 179078506 /home/admin/rds2tae/clusters/ump_proxy/erts-5.9.2/bin/beam.smp
00803000-00857000 rw-p 00203000 68:09 179078506 /home/admin/rds2tae/clusters/ump_proxy/erts-5.9.2/bin/beam.smp
00857000-0086e000 rw-p 00857000 00:00 0
1d843000-1dc54000 rw-p 1d843000 00:00 0 [heap]
406fb000-406fc000 —p 406fb000 00:00 0

399e403000-399e404000 rw-p 00003000 68:02 192528 /lib64/libgthread-2.0.so.0.1200.3
2aaaaaaac000-2aaaabff1000 rw-p 2aaaaaaac000 00:00 0
2aaaac1e5000-2aaaac2e6000 rw-p 2aaaac1e5000 00:00 0
2aaaac3e6000-2aaaac4e7000 rw-p 2aaaac3e6000 00:00 0
2aaaac6[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
[os_mon] memory supervisor port (memsup): Erlang has closed
heart: Sun Jun 23 07:41:32 2013: Erlang has closed.
heart: Sun Jun 23 07:41:34 2013: Executed “/home/admin/rds2tae/clusters/ump_proxy/bin/ump_proxy start”. Terminating.

=====
===== LOGGING STARTED Sun Jun 23 07:41:34 CST 2013
=====
Exec: /home/admin/rds2tae/clusters/ump_proxy/erts-5.9.2/bin/erlexec -boot /home/admin/rds2tae/clusters/ump_proxy/releases/2.3.6/ump_proxy -mode embedded -config /home/admin/rds2tae/clusters/ump_proxy/etc/sys.config -args_file /home/admin/rds2tae/clusters/ump_proxy/etc/vm.args — console
Root: /home/admin/rds2tae/clusters/ump_proxy
heart_beat_kill_pid = 17047
Erlang R15B02 (erts-5.9.2) [64-bit] [smp:16:16] [async-threads:5] [hipe] [kernel-poll:true]

从日志可以看到我们的vm crash了,原因也有,心跳程序在接着的几秒内把系统重新拉起来了。
因为系统服务没受到影响,从监控系统看到vm crash了一次,但是系统没有足够的线索。
我们可以把os的coredump打开,可以观察到这些现象。

首先我们来验证下开不开coredump的效果:

$ cat x.c
int main(int argc, char* argv[])
{
  *(char*)0x000  =0;
  return 0;
}
$ gcc -g x.c
$ ./a.out 
Segmentation fault
$ ulimit  -c 999999999
$ ls -al core.*
-rw------- 1 chuba users 184320 Jun 27 11:45 core.23021
$ gdb ./a.out core.23021 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/chuba/a.out...done.
[New Thread 23021]
Missing separate debuginfo for 
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/a6/816913e0668c79e9ac0c257a1d28cdffe82e4a
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `./a.out'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000400484 in main (argc=1, argv=0x7ffff9931d48) at x.c:3
3         *(char*)0x000  =0;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64
(gdb) 

可以看到打开ulimit -c后,我们获取到了 coredump文件,叫做core.23021, 通过gdb我们获取到了系统crash的原因。

而erlang也可以通过强制产生coredump来验证系统是不是正常运作的,我来演示下:

$ erl
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1 (abort with ^G)
2> os:getpid().
"2294"
3> erlang:halt(abort).
Aborted (core dumped)
$ ls -al core.*
-rw------- 1 chuba users 290553856 Jun 27 11:22 core.2294

通过erlang:halt(abort)来强制产生vm的失效,来模拟线上的故障,可以让我们有机会来设计系统来捕获这些异常。

当然erlang还提供了调试这些失效的方法,这就是强大的cerl, 有各种强大的gdb command协助用户调查问题,我给大家演示下:

$ bin/cerl -break main
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/chuba/.kerl/builds/r16b01_rc1/otp_src_git/bin/x86_64-unknown-linux-gnu/beam.smp...done.
%---------------------------------------------------------------------------
% Use etp-help for a command overview and general help.
%
% To use the Erlang support module, the environment variable ROOTDIR
% must be set to the toplevel installation directory of Erlang/OTP,
% so the etp-commands file becomes:
%     $ROOTDIR/erts/etc/unix/etp-commands
% Also, erl and erlc must be in the path.
%---------------------------------------------------------------------------
etp-set-max-depth 20
etp-set-max-string-length 100
--------------- System Information ---------------
OTP release: R16B01
ERTS version: 5.10.2
Compile date: Sat Jun 15 13:49:06 2013
Arch: x86_64-unknown-linux-gnu
Endianess: Little
Word size: 64-bit
Halfword: no
HiPE support: yes
SMP support: yes
Thread support: yes
Kernel poll: Supported
Debug compiled: no
Lock checking: no
Lock counting: no
System not initialized
--------------------------------------------------
(gdb) r
Starting program: /home/chuba/.kerl/builds/r16b01_rc1/otp_src_git/bin/x86_64-unknown-linux-gnu/beam.smp -- -root /home/chuba/.kerl/builds/r16b01_rc1/otp_src_git -progname /home/chuba/.kerl/builds/r16b01_rc1/otp_src_git/bin/cerl -- -home /home/chuba --
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff4cff700 (LWP 10751)]
...
[New Thread 0x7fffe8cf0700 (LWP 10779)]
[New Thread 0x7fffe82ef700 (LWP 10780)]
Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:16:16] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.2  (abort with ^G)
1> erlang:halt(abort).

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffee6f9700 (LWP 10770)]
0x000000322aa32885 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64
(gdb) bt
#0  0x000000322aa32885 in raise () from /lib64/libc.so.6
#1  0x000000322aa34065 in abort () from /lib64/libc.so.6
#2  0x0000000000450f9b in erl_exit_vv (n=-2147483647, flush_async=<value optimized out>, fmt=0x5d05d4 "", 
    args1=0x7fffee6f8c00, args2=0x7fffee6f8be0) at beam/erl_init.c:1788
#3  0x0000000000451197 in erl_exit (n=10745, fmt=0x6 <Address 0x6 out of bounds>) at beam/erl_init.c:1798
#4  0x000000000047fdca in halt_1 (A__p=0x7ffff4f40390, BIF__ARGS=0x7ffff6218480) at beam/bif.c:3909
#5  0x00000000005391b7 in process_main () at beam/beam_emu.c:3364
#6  0x00000000004a55e3 in sched_thread_func (vesdp=0x7ffff4201cc0) at beam/erl_process.c:5738
#7  0x00000000005b67a6 in thr_wrapper (vtwd=0x7fffffffdd70) at pthread/ethread.c:106
#8  0x000000322ae077f1 in start_thread () from /lib64/libpthread.so.0
#9  0x000000322aae570d in clone () from /lib64/libc.so.6

小结: 心跳和日志系统是必须的,有助提高系统的稳定性。

祝玩得开心。

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Erlang探索, Linux, 源码分析 Tags:
  1. xiongli
    March 19th, 2015 at 10:58 | #1

    霸爷您好!
    我最近对erlang节点做压力测试,在一段时间后节点会挂掉,既没有erlang_dump文件,也没有系统的core文件产生,请问这个该如何下手找原因啊?
    $ulimit -c
    $ulimited

    Yu Feng Reply:

    os core先确认打开,然后再看看我博客里面相关的crashdump文章。

Comments are closed.