I’m used to debugging issues with logs or metrics when they are presented to me on a lovely dashboard with an intuitive UI. However, if for some reason the dashboard isn’t being populated or if a particular service’s logs are unavailable, debugging gets trickier. Now these instances are usually few and far between, but they do happen and being familiar with tools to debug what’s happening to a process on a host is pretty valuable during these times.
When I’m debugging something that the logs or metrics aren’t surfacing, I ssh into hosts. Of course, this isn’t scalable or elegant or any of the myriad things the internet has told us, but for ad-hoc analysis, this works surprisingly well for me.
Just like, you know, with print statements and debugging.
Let me make it very clear right now that I’m not an SRE or an Operations engineer. I’m primarily a developer who also happens to deploy the code I write and debug it when things go wrong. As often as not, when I’m on a host I’ve never been before, the hardest thing for me is finding things. Like for instance, what port is a process listening on? Or more importantly, what file descriptor is a particular daemon logging to? And even when I do manage to find answers to these questions by dint of a mix of
ls and lots and lots of wishful
grepping, many a time the “answers” I get surface zero information or get me just plain incorrect data.
If this were a talk by Raymond Hettinger, the core CPython developer, this would be the moment when the audience would be expected to say there must be a better way.
And there is.
What’s become my go-to tool for finding things is a pretty nifty tool called
el-soff, though some tend to be partial towards
liss-off or just
el-es-o-eff) is an incredibly useful command that lists all open files.
lsofis great for finding things because in Unix every thing is a file.
lsof is an astonishingly versatile debugging tool that can quite easily replace
netstat etc. in one’s workflow.
Options … an embarrassment of riches
A veteran SRE who has been SREing for decades before the term “SRE” was even coined once told me — “ I stopped learning options for lsof once I had what I needed. Learn the most important ones and that’s all you’ll mostly ever need.”
lsof comes with an extensive list of options.
NAMElsof - list open filesSYNOPSISlsof [ -?abChKlnNOPRtUvVX ] [ -A A ] [ -c c ] [ +c c ] [ +|-d d ] [+|-D D ] [ +|-e s ] [ +|-f [cfgGn] ] [ -F [f] ] [ -g [s] ] [ -i [i] ] [-k k ] [ +|-L [l] ] [ +|-m m ] [ +|-M ] [ -o [o] ] [ -p s ] [ +|-r[t[m<fmt>]] ] [ -s [p:s] ] [ -S [t] ] [ -T [t] ] [ -u s ] [ +|-w ] [ -x[fl] ] [ -z [z] ] [ -Z [Z] ] [ -- ] [names]
The man page would be the best reference if you’re interested in what each option does. The ones I most commonly use are the following:
-u— This lists all files opened by a specific user. The following example lists the number of files held open by the user
cindy@ubuntu:~$ lsof -u cindy | wc -l248
In general, when an option is preceeded by a
^ , it implies a negation. So if we want to know the number of files on this host that’s opened by all other users except
cindy@ubuntu:~$ lsof -u^cindy | wc -l38193
-U — This option selects all Unix Domain Socket files.
cindy@ubuntu:~$ lsof -U | head -5COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAMEinit 1 root 7u unix 0xffff88086a171f80 0t0 24598 @/com/ubuntu/upstartinit 1 root 9u unix 0xffff88046a22b480 0t0 22701 socketinit 1 root 10u unix 0xffff88086a351180 0t0 39003 @/com/ubuntu/upstartinit 1 root 11u unix 0xffff880469006580 0t0 16510 @/com/ubuntu/upstart
-c — This lists all files held open by processes executing the command that begins with the characters of
c. For example, if you want to see the first 15 files held open by all Python processes running on a given host:
cindy@ubuntu:~$ lsof -cpython | head -15COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAMEpython2.7 16905 root cwd DIR 9,1 4096 271589387 /home/cindy/sourceboxpython2.7 16905 root rtd DIR 9,1 4096 2048 /python2.7 16905 root txt REG 9,1 3345416 268757001 /usr/bin/python2.7python2.7 16905 root mem REG 9,1 11152 1610852447 /usr/lib/python2.7/lib-dynload/resource.x86_64-linux-gnu.sopython2.7 16905 root mem REG 9,1 101240 1610899495 /lib/x86_64-linux-gnu/libresolv-2.19.sopython2.7 16905 root mem REG 9,1 22952 1610899509 /lib/x86_64-linux-gnu/libnss_dns-2.19.sopython2.7 16905 root mem REG 9,1 47712 1610899515 /lib/x86_64-linux-gnu/libnss_files-2.19.sopython2.7 16905 root mem REG 9,1 33448 1610852462 /usr/lib/python2.7/lib-dynload/_multiprocessing.x86_64-linux-gnu.sopython2.7 16905 root mem REG 9,1 54064 1610852477 /usr/lib/python2.7/lib-dynload/_json.x86_64-linux-gnu.sopython2.7 16905 root mem REG 9,1 18936 1610619044 /lib/x86_64-linux-gnu/libuuid.so.1.3.0python2.7 16905 root mem REG 9,1 30944 1207967802 /usr/lib/x86_64-linux-gnu/libffi.so.6.0.1python2.7 16905 root mem REG 9,1 136232 1610852472 /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.sopython2.7 16905 root mem REG 9,1 77752 1610852454 /usr/lib/python2.7/lib-dynload/parser.x86_64-linux-gnu.sopython2.7 16905 root mem REG 9,1 387256 1610620979 /lib/x86_64-linux-gnu/libssl.so.1.0.0
More interestingly, if you have a bunch of Python2.7 and Python 3.6 processing running on a host, you can find the list of files held open by the non-Python2.7 processes:
cindy@ubuntu:~$ lsof -cpython -c^python2.7 | head -10COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAMEpython 20017 root cwd DIR 9,1 4096 2048 /python 20017 root rtd DIR 9,1 4096 2048 /python 20017 root txt REG 9,1 3345416 268757001 /usr/bin/python2.7python 20017 root mem REG 9,1 11152 1610852447 /usr/lib/python2.7/lib-dynload/resource.x86_64-linux-gnu.sopython 20017 root mem REG 9,1 6256 805552236 /usr/lib/python2.7/dist-packages/_psutil_posix.x86_64-linux-gnu.sopython 20017 root mem REG 9,1 14768 805552237 /usr/lib/python2.7/dist-packages/_psutil_linux.x86_64-linux-gnu.sopython 20017 root mem REG 9,1 10592 805451779 /usr/lib/python2.7/dist-packages/Crypto/Util/strxor.x86_64-linux-gnu.sopython 20017 root mem REG 9,1 11176 1744859170 /usr/lib/python2.7/dist-packages/Crypto/Cipher/_ARC4.x86_64-linux-gnu.sopython 20017 root mem REG 9,1 23560 1744859162 /usr/lib/python2.7/dist-packages/Crypto/Cipher/_Blowfish.x86_64-linux-gnu.so
+d— This helps you search for all open instances of any directory and its top level files and directories.
cindy@ubuntu:~$ lsof +d /usr/bin | head -4COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAMEcircusd 1351 root txt REG 9,1 3345416 268757001 /usr/bin/python2.7docker 1363 root txt REG 9,1 19605520 270753792 /usr/bin/dockerrunsvdir 1597 root txt REG 9,1 17144 272310314 /usr/bin/runsvdir
-d — By far the option I most commonly use, next only to
-p. This option specifies a list of comma separated file descriptors to include/exclude. From the docs:
The list is an exclusion list if all entries of the set begin with '^'. It is an inclusion list if no entry begins with '^'. Mixed lists are not permitted.A file descriptor number range may be in the set as long as neither member is empty, both members are numbers, and the ending member is larger than the starting one - e.g., ''0-7'' or ''3-10''. Ranges may be specified for exclusion if they have the '^' prefix - e.g., ''^0-7'' excludes all file descriptors 0 through 7.Multiple file descriptor numbers are joined in a single ORed set beforeparticipating in AND option selection.When there are exclusion and inclusion members in the set,lsof reports them as errors and exits with a non-zero return code.
-p — I can’t recall a time I’ve used lsof without ever using this option, which lists all files held open by a given pid.
On Ubuntu, to find out all files held open by, say, pid 1
Whereas on my MacBook Air:
-P — This option inhibits the conversion of port numbers to port names for network files. It is also useful when port name lookup is not working properly.
This can be used in combination with another option —
-n , which inhibits the conversion of network numbers to host names for network files. It is also useful when host name lookup is not working properly.
Inhibiting both the aforementioned conversion can sometimes make lsof run faster.
-i — This option selects the listing of files any of whose Internet address matches the address specified in
i. If no address is specified, this option selects the listing of all Internet and network files.
With lsof, one can, for instance, look at the TCP connection your Slack or Dropbox client has open. For fun, try seeing how many connections your Chrome tabs (each tab is a standalone process) has open.
lsof -i -a -u $USER | grep Slack
With lsof, one can also look at all the TCP sockets opened by your local Dropbox client.
lsof also allows one to look at UDP connections open with
lsof -i 6 will get you the list of IPv6 connections open.
-t — This options suppresses all other information except the process ID’s — and I often use this when I want to pipe the pids to some other function, mostly
cindy@ubuntu:~$ lsof -t /var/log/dummy_svc.log1235
Generally, lsof will OR the results of more than one option is used. Specifying the
-a option will give you a logical AND of the results.
Of course, there are several exceptions to this rule and again, the man page is your friend here, but the TL: DR is:
Normally list options that are specifically stated are ORed - i.e., specifying the -ioption without an address and the -ufoo option produces a listing of all network files OR files belonging to processes owned by user ''foo''. The exceptions are:1. the '^' (negated) login name or user ID (UID), specified with the -u option;
2. the '^' (negated) process ID (PID), specified with the -p option;
3. the '^' (negated) process group ID (PGID), specified with the -g option;
4. the '^' (negated) command, specified with the -c option;
5. the ('^') negated TCP or UDP protocol state names, specified with the -s [p:s] option.Since they represent exclusions, they are applied without ORing or ANDing and take effect before any other selection criteria are applied.The -a option may be used to AND the selections. For example, specifying -a, -U, and -ufoo produces a listing of only UNIX socket files that belong to processes owned by user ''foo''.
A warstory … of sorts.
OK, so I’m stretching the truth here by calling this a “warstory”, but still a time when lsof came in handy.
A couple of weeks ago, I had to stand up a a single instance of a new service in a test environment. The test service in question wasn’t hooked up to the production monitoring infrastructure. I was trying to debug why a process that was freshly launched wasn’t registering itself with Consul and was therefore undiscoverable by another service. Now I don’t know about you, but if something isn’t working as expected, I look at the logs of the service I’m trying to debug, and in most cases the logs point to the root cause right away.
This service in question was being run under circus, a socket and process manager. Logs for processes run under circus are stored in a specific location on the host — let’s call it
/var/log/circusd. Newer services on the host run under a different process manager called s6 which logs to a different namespace. Then there’s the logs generated by socklog/svlogd which again live somewhere else. In short, there’s no dearth of logs and the problem was just to find to what file descriptor my crashing process was logging to.
Since I knew the process I was trying to debug was running under circus, tailing
/var/log/circusd/whatever_tab_completion_suggested would allow me to look at the stdout and stderr streams for this process. Except, tailing the logs showed absolutely nothing. Quickly it became evident I was looking at the wrong log file and sure enough, upon closer inspection, there were two files under
/var/log/circusd one called
stage-svcname-stderr.logand the other called
staging-svcname.stderr.log and tab completion was picking the wrong file.
One way to see which file was actually being used by the process in question to log to is to run
lsof -l filename which displays all the processes that have an open file descriptor to it. It turned out no running process was holding on to the log file I was tailing — which meant it was safe for deletion. Tailing the other immediately showed me why the process was crashing (and circus was restarting it after the crash— leading to a crash loop).
The more I use it, the more it replaces a bunch of other tools and surfaces more actionable information. A far more interesting post would be one on how lsof works internally — but that post is a WIP right now.