Diving deeper: strace

Today I want to talk about a sneaky little Linux command, one of those that are extremely helpful additions in any developer’s toolbox, but that way too often are sadly forgotten. I’ve been using it recently to pinpoint the exact cause of a failing endpoint call in one of the various partitions that make up the KPN API. Later on I will show you how, but first — to refresh our memory a bit — let’s start by going over some basic theory.

The strace command

You don’t need to resort to Google to understand what strace does: you can simply look at it’s man page. I checked it on a Ubuntu server and its description paragraph recites, among the rest:

“It [strace] intercepts and records the system calls which are called by a process and the signals which are received by a process.”

We can feed it a binary and it will follow its execution from start to end outputting all the system calls that are made in the process. So basically it tells us all the inner stuff the kernel is doing when executing the target program. It starts to become clear how useful this can be when it comes down to troubleshooting. But its real power is unleashed when combined with a bunch of options and/or with other executables (mainly grep). Let’s take a look at some examples.

The vanilla command

I invite you to run strace against whatever Linux command you can think of. It will shed some lights on what exactly is going on at a very low level and it is generally useful to get an idea of what are the most used system calls in any program. You will be surprised by how many things are going on under the hood when even the simplest program is running. For example, try to:

strace ls

Such a simple command, that only lists the content of the current directory, is actually doing a lot of things to get its job done! Can you follow along this humongous list of calls? A good exercise would be to go check out the manual page for all the calls that you don’t know. It will generally help you understand the OS you are dealing with every day a bit better.

Following a specific process by its ID

Sometimes you want to follow what happens when a program is already running. In this case you can leverage some powerful strace option to follow a process simply knowing its PID. You just have to run a:

ps aux | grep <your-program-regex-here>

To get the PID and then use it with strace using the -p option!

Ever wondered what Spotify is doing at after you launch it? Just do:

ps aux | grep spotify

and then:

strace -p <PID-from-previous-command>

What about Chrome? You can play around and check any process you’d like. You can always get some interesting insight!

Tracking specific syscalls

For sure, the integral strace output can be daunting. Let’s say you already suspect your executable is failing on a file-writing operation (so on open and read calls maybe). You can use the handy -e option to narrow down the output of strace to only the call(s) that you want to inspect.

strace -e trace=open,read <your_executable>

There are countless filtering options at your disposal: you can trace only a specified set of system calls, you can trace only calls that take a file name as argument (using -e trace=file, which could have been useful as well in the previous example) and many many others. As usual, I invite you to check the man page at the -e section to find out more.

Get reports on execution

Especially good to figure out how memory/CPU expensive a program is, the -c option is your best friend when you want to have an eagle eye look at the overall performance of the target command.

I run it again on the ls command, just as an example. This is the output:

You can see that the main part of the ls execution is spent on allocating memory, opening files and setting the correct access protection on them. Don’t be fooled though, as ls doesn’t open the files in the current directory. In fact, if you run

strace -e open,write ls

you will immediately notice that the only files being opened are Linux libraries to access and get/set permissions on files (like libseLinux and libacl).

Save results to a file

Often the output of strace will be a bit daunting and having the option to access it from a file would make it more handy to analyse. Luckily, there is an option that has us covered: -o. Just run

strace -o strace_out.txt <your_executable>

and you are set! Now you can simply open the generated file with a text editor of your choice!

A real case scenario

In my case, the situation — as mentioned before — was the following: an API call in the KPN infrastructure was supposed to write a file in a specific directory but was failing without reporting any specific error. Because of a misconfiguration, Sentry was not reporting the problem.

What to do? Well, there are a lot of ways to proceed from here, the most obvious one being going to check out the code for the failing endpoint and pinpoint where it might be failing. But before doing such a time-consuming task I wanted to have a quick and accurate overview of where the issue was. So I ran strace. I had to combine a few of the tricks I exposed above in a simple handy command.

First of all I looked up what was the PID of the uwsgi worker that was running the application. Once I figured that out I launched strace to follow it (with the -p option).

The first results were encouraging:

epoll_wait(13, {}, 1, 1000) = 0
wait4(-1, 0x7fffce7ff06c, WNOHANG, NULL) = 0
epoll_wait(13, {}, 1, 1000) = 0
wait4(-1, 0x7fffce7ff06c, WNOHANG, NULL) = 0
epoll_wait(13, {{EPOLLIN, {u32=5, u64=5}}}, 1, 1000) = 1

Basically uwsgi was waiting for something to happen. Cool, that sounded logical. So I needed to make something happen. In another terminal, I reached the failing endpoint to have it run its code. Unfortunately I was seeing no relevant reaction in the strace panel beside some reading from the application source code directory.

What was I doing wrong? I thought that the inner intricacies of uwsgi may have it spawn additional processes to take care of specific tasks. So I decided to add the -f option to the previous command to follow all forked processes that derive from the main vassal. And finally, among all the other information:

open("/internal/directory/structure/filename.csv", O_RDONLY| O_TRUNC |O_CREAT) = -13

There it was. A sys call exiting with code -13 which is just Linux jargon for “Permission Denied”. This endpoint was trying to write a file in a directory where it was not allowed to. The following fix was pretty easy: a simple tweak of the secondary group of the user running the application did the trick, after a quick check with another system administrator to make sure that allowing such application to write there was ok. And voilà, a mysterious problem was solved in a matter of minutes, thanks to strace!

Conclusion

Of course there is much more that you can do if you really want to leverage all the goodies of this powerful command. The ones referenced here are just the options that I use more often and should serve to fire your own curiosity. I invite you to go check out the man page once more to have an idea of everything you can do, it will be time well spent. Happy debugging!

— 
 Follow Elements on Facebook, Twitter and LinkedIn!


Originally published at www.elements.nl on February 13, 2018.