用strace查找進(jìn)程卡死的原因分析
最近遇到進(jìn)程卡死的情況,但是自己調(diào)試的過(guò)程中并不一定能復(fù)現(xiàn),都是需要運(yùn)行一段時(shí)間某些條件下才會(huì)觸發(fā),對(duì)于這種運(yùn)行著不能破壞現(xiàn)場(chǎng)的情況,我們可以使用gdb -p和strace -p來(lái)跟蹤。
首先我們用ps auxf查看我們的進(jìn)程執(zhí)行到了哪一步:
可以看到執(zhí)行到了docker exec -i 178.20.1.229_0115034556 ls然后就卡死了
然后我們進(jìn)一步通過(guò)strace查看執(zhí)行這個(gè)操作死在哪個(gè)系統(tǒng)回調(diào)了:
描述符19的具體意義我們可以進(jìn)入/proc/pid/fd再查看一下:
我們可以發(fā)現(xiàn),19代表的是pipe,我們這里是死在了讀pipe上面。
/************************************************/
分割線,后面再次出現(xiàn)這個(gè)問(wèn)題我們先用ps auxf查看進(jìn)程號(hào)和進(jìn)程執(zhí)行到了哪一步,可以看到進(jìn)程號(hào)是27678,卡在docker exec
root 27678 0.3 0.4 512172 16500 Sl python /wns/cloud/app/com_host/main.pycroot 25011 0.0 0.0 4332 652 S \_ /bin/sh -c docker exec -i mongo_docker_master lsroot 25014 0.0 0.2 136592 10600 Sl \_ docker exec -i mongo_docker_master ls繼續(xù)用strace -p 27678跟蹤發(fā)現(xiàn)卡在read,文件描述符是14
root@localhost:/# strace -p 27678 Process 27678 attachedread(14,接著我們cd /proc/27678/在這里我們可以查看進(jìn)程狀態(tài)
root@localhost:/proc/27678# cat status Name:pythonState:S (sleeping)Tgid:27678Ngid:0Pid:27678PPid:27677查看進(jìn)程的內(nèi)核堆棧的調(diào)試信息wchan表示導(dǎo)致進(jìn)程睡眠或者等待的函數(shù)
root@localhost:/proc/27678# cat stack [<ffffffff811a91ab>] pipe_wait+0x6b/0x90[<ffffffff811a9c04>] pipe_read+0x344/0x4f0[<ffffffff811a00bf>] do_sync_read+0x7f/0xb0[<ffffffff811a0681>] vfs_read+0xb1/0x130[<ffffffff811a1110>] SyS_read+0x80/0xe0[<ffffffff818d4c49>] system_call_fastpath+0x16/0x1b[<ffffffffffffffff>] 0xffffffffffffffffroot@localhost:/proc/27678# cat wchan pipe_wait現(xiàn)在我們查看一下進(jìn)程打開(kāi)的文件描述符14代表什么pipe文件
root@localhost:/proc/27678# ls -l ./fdtotal 0lr-x------ 1 root root 64 Mar 26 17:19 0 -> pipe:[30690124]l-wx------ 1 root root 64 Mar 26 17:19 1 -> pipe:[30690125]lrwx------ 1 root root 64 Mar 26 17:19 10 -> socket:[30691732]lr-x------ 1 root root 64 Mar 26 17:19 11 -> /dev/urandomlrwx------ 1 root root 64 Mar 26 17:19 12 -> socket:[30719611]lrwx------ 1 root root 64 Mar 26 17:19 13 -> socket:[30719610]lr-x------ 1 root root 64 Mar 26 17:19 14 -> pipe:[38483750]我們已經(jīng)可以確定main創(chuàng)建子進(jìn)程執(zhí)行shell命令docker exec -i mongo_docker_master ls,同時(shí)通過(guò)pipe和子進(jìn)程通信,結(jié)果卡在了read pipe上。
其實(shí)在這里我們也可以使用lsof來(lái)定位可以看到進(jìn)程27678打開(kāi)的FD 14是pipe,這里u代表可讀可寫(xiě),r代表可讀
sangfor ~ # lsof -d 14COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAMEmongod 1907 root 14u REG 251,0 36864 130683 /wns/data/mongodb/db/collection-7--588642557116981989.wtsyslog-ng 3446 root 14u unix 0xffff88012227d800 0t0 40557736 /dev/logdockerd 4025 root 14u unix 0xffff8800b8d5d800 0t0 13941 /run/docker/libnetwork/a73bd949b5fbb89c2b8bec3b4ac6af0a948a944958c8b037d9e6c9b324b44331.sockdocker-co 9382 root 14u 00000,90 9553 anon_inodedocker-co 21204 root 14u 00000,90 9553 anon_inodepython 27678 root 14r FIFO0,8 0t0 38483750 pipe也可以直接查看進(jìn)程27678打開(kāi)的可以看到14是pipe
sangfor ~ # lsof -p 27678COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAMEpython 27678 root 0r FIFO0,8 0t0 30690124 pipepython 27678 root 1w FIFO0,8 0t0 30690125 pipepython 27678 root 2w FIFO0,8 0t0 30690126 pipepython 27678 root 3u 00000,90 9553 anon_inodepython 27678 root 4u 00000,90 9553 anon_inodepython 27678 root 5u pack 30691718 0t0 unknown type=SOCK_RAWpython 27678 root 6w REG 251,0 76106652 130565 /wns/data/com_host/etc/config/err.logpython 27678 root 7u IPv4 30691716 0t0 TCP Sangfor:53102->Sangfor:42457 (ESTABLISHED)python 27678 root 8u IPv4 30691717 0t0 TCP Sangfor:42457->Sangfor:53102 (ESTABLISHED)python 27678 root 9u IPv4 30691731 0t0 TCP db.sdwan:54072->sdwan.io:27017 (ESTABLISHED)python 27678 root 10u IPv4 30691732 0t0 TCP db.sdwan:54074->sdwan.io:27017 (ESTABLISHED)python 27678 root 11r CHR1,9 0t0 30690329 /dev/urandompython 27678 root 12u IPv4 30719611 0t0 TCP db.sdwan:51404->db.sdwan:37017 (ESTABLISHED)python 27678 root 13u IPv4 30719610 0t0 TCP db.sdwan:47124->db.sdwan:27017 (ESTABLISHED)python 27678 root 14r FIFO0,8 0t0 38483750 pipe總結(jié)以上為個(gè)人經(jīng)驗(yàn),希望能給大家一個(gè)參考,也希望大家多多支持好吧啦網(wǎng)。
