strace 是 Linux 平台和一些类 Unix 平台调试工具,可以方便的查看程序执行时的系统调用和接收到的信号。它几乎存在于所有的 Linux 系统中并且可以解决非常多的问题。
strace -p pid 跟踪某个后台进程 strace -o filename 跟踪结果输出到文件 strace -T 记录每个系统调用花费的时间 strace -c -p pid 记录时间,调用次数,错误数,CPU 时钟等 strace -t (或者 -tt)记录每个系统调用发生是的时间(时分秒的格式) strace -f 跟踪子进程 strace -e trace=nanosleep 只记录相关的系统调用信息。 -e trace=network 只记录和网络api相关的系统调用 -e trace=file 只记录涉及到文件名的系统调用 -e trace=desc 只记录涉及到文件句柄的系统调用
有多个版本的库安装在系统中,不知道系统加载的是哪个库或配置文件时:
$ strace php 2>&1 | grep php.ini open("/usr/bin/php.ini", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory) open("/etc/php5/cli/php.ini", O_RDONLY|O_LARGEFILE) = 3 lstat64("/etc/php5/cli/php.ini", {st_mode=S_IFREG|0644, st_size=68105, ...}) = 0
由上可知,php 读取的是 /etc/php5/cli/php.ini.
-e 参数可以指定想查看的系统调用
$ strace -e open php
当一个进程突然消耗大量的 CPU 或看上去死掉了的时候,先找出进程的 pid 然后:
$ strace -p 22598 Process 1124 attached - interrupt to quit accept(0, $ strace -p 4456 Process 4456 attached - interrupt to quit waitpid(4464,
第一个进程正在等待建立网络连接,第二个在等待另一个进程改变状态。 上面 2 个一个是正常的等待请求的 nginx,另一个是 flock 进程。
strace -p 可以省去大量猜测的工作。
在编译应用时可以打开 profiling,它可以提供详细的信息,特别是你代码的哪一部分消耗了时间。更有用的是可以快速使用 strace 查看某个进程在花时间做什么并以此来诊断问题。CPU 大量消耗到底是因为在干活还是别的什么原因。
$ strace -c -p 23134 Process 23134 attached - interrupt to quit Process 23134 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 96.62 0.001000 31 32 poll 3.38 0.000035 0 134 clock_gettime 0.00 0.000000 0 1 restart_syscall 0.00 0.000000 0 6 close 0.00 0.000000 0 18 time 0.00 0.000000 0 5 nanosleep 0.00 0.000000 0 6 rt_sigaction 0.00 0.000000 0 12 rt_sigprocmask 0.00 0.000000 0 12 fcntl64 0.00 0.000000 0 6 socket 0.00 0.000000 0 6 6 connect 0.00 0.000000 0 6 send 0.00 0.000000 0 7 recv 0.00 0.000000 0 6 getsockopt ------ ----------- ----------- --------- --------- ---------------- 100.00 0.001035 257 6 total
使用 -c -p 后等待足够的时间,然后 ctrl-c 中断,strace 会打印出像上面的结果。
查找为什么一个进程不能连接指定服务器通常是非常痛苦的。有可能是 DNS 失败,连接被挂起,服务器返回一些异常值等待。 有人可能会想到另一个强大的工具 tcpdump 可以分析这些问题,但是它的参数和输出比较复杂。当你想在成百的连接数据库进程在做什么的时候是非常费力的事情,但是 strace 同样可以给出简单明白的结果,它只和返回进程的系统调用想着的数据。可以让它变的很轻松。
$ strace -e trace=network nc tjs.im 80 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) socket(PF_NETLINK, SOCK_RAW, 0) = 3 bind(3, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0 getsockname(3, {sa_family=AF_NETLINK, pid=23806, groups=00000000}, [12]) = 0 sendto(3, "\24\0\0\0\26\0\1\3\376e\327R\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20 recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"D\0\0\0\24\0\2\0\376e\327R\376\\\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 148 recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"@\0\0\0\24\0\2\0\376e\327R\376\\\0\0\n\200\200\376\1\0\0\0\24\0\1\0\0\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 192 recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\376e\327R\376\\\0\0\0\0\0\0\1\0\0\0\24\0\1\0\0\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 20 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("109.74.194.20")}, 16) = 0 send(3, "k3\1\0\0\1\0\0\0\0\0\0\3tjs\2im\0\0\34\0\1", 24, MSG_NOSIGNAL) = 24 recvfrom(3, "k3\201\200\0\1\0\0\0\1\0\0\3tjs\2im\0\0\34\0\1\300\f\0\6\0\1\0\0"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("109.74.194.20")}, [16]) = 83 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("109.74.194.20")}, 16) = 0 send(3, "}D\1\0\0\1\0\0\0\0\0\0\3tjs\2im\7members\6lino"..., 43, MSG_NOSIGNAL) = 43 recvfrom(3, "}D\205\203\0\1\0\0\0\0\0\0\3tjs\2im\7members\6lino"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("109.74.194.20")}, [16]) = 43 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("109.74.194.20")}, 16) = 0 send(3, "1\337\1\0\0\1\0\0\0\0\0\0\3tjs\2im\0\0\1\0\1", 24, MSG_NOSIGNAL) = 24 recvfrom(3, "1\337\201\200\0\1\0\1\0\0\0\0\3tjs\2im\0\0\1\0\1\300\f\0\1\0\1\0\0"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("109.74.194.20")}, [16]) = 40 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("178.79.161.160")}, 16) = -1 EINPROGRESS (Operation now in progress) getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 = ?
比如服务器上同时开启几个子进程的时候,比如 nginx
strace $(pidof nginx | sed 's/\([0-9]*\)/-p \1/g') Process 1021 attached - interrupt to quit Process 1020 attached - interrupt to quit Process 1019 attached - interrupt to quit Process 1018 attached - interrupt to quit Process 1017 attached - interrupt to quit [pid 1017] rt_sigsuspend([] <unfinished ...> [pid 1018] gettimeofday({1389866457, 845838}, NULL) = 0 [pid 1018] epoll_wait(9, <unfinished ...> [pid 1019] gettimeofday({1389866457, 846204}, NULL) = 0 [pid 1019] epoll_wait(11, <unfinished ...> [pid 1020] gettimeofday({1389866457, 846547}, NULL) = 0 [pid 1020] epoll_wait(13, <unfinished ...> [pid 1021] gettimeofday({1389866457, 846859}, NULL) = 0 [pid 1021] epoll_wait(15, <unfinished ...> [pid 1018] <... epoll_wait resumed> {}, 512, 500) = 0 [pid 1019] <... epoll_wait resumed> {}, 512, 500) = 0 [pid 1018] gettimeofday({1389866458, 347297}, NULL) = 0 [pid 1019] gettimeofday( <unfinished ...> [pid 1018] epoll_wait(9, <unfinished ...> [pid 1019] <... gettimeofday resumed> {1389866458, 347790}, NULL) = 0 [pid 1020] <... epoll_wait resumed> {}, 512, 500) = 0 [pid 1019] epoll_wait(11, <unfinished ...> [pid 1020] gettimeofday({1389866458, 348711}, NULL) = 0 [pid 1020] epoll_wait(13, <unfinished ...> [pid 1018] <... epoll_wait resumed> {}, 512, 500) = 0 ...