Debugging using strace

Sometimes you just gotta pop the hood

6 min readAug 1, 2017

If you’re not familiar with using strace then this article is for you. strace is one of the best tools you can use to debug any running program. Before we dive in, you need to be familiar with system calls. It’s important that you have some familiarity with them as strace exposes system calls that the program is using. So before we get into strace lets cover some common system calls.

Sys Calls

A syscall is the “fundamental interface between the application and the linux kernel.” If you check the man page you can see that there are a lot system calls but instead of covering them all, I’ll list out the more common calls you’ll see.

Open() — System call used to open or create a file
Read() — System call used to read from a file
Write() — System call used to write to a file
Connect() — System call used to open connections to other applications/websites/etc.
futex() — System call used to force a program to wait till a condition is true or to implement a lock in memory.

The last one in my opinion is pretty important when debugging. Most times when a program is hung, it’s because there’s a futex call that cannot be satisfied (and the word futux does not suggest a program is waiting at first glance, unlike connect or read). I have included the link to their respective man page if you want to read further into each. I wanted to introduce these calls before getting into strace because when going under the hood of an application, all you see are different syscalls that can point to an issue if deciphered correctly.

Strace

Strace is a tool that I like to use first when working with a troublesome process. It gives you the most information and can point you in the right direction when solving an issue from not being able to read/write a file, connection problems, or memory issues. I want to show you some use cases as me yapping about it won’t help you and showing examples will demonstrate how strace works.

A quick warning

Before we explore strace I want to talk about some drawbacks. There are many commands that should not be run in production, and this is one of them. There are two reasons; system pausing and performance. When strace runs, it uses ptrace() to attach to the process, which causes it to pause the process when you first start, but for each system call. This can cause tremendous overhead for the application, and leads me to my second point that it causes performance issues.

I want to demonstrate the performance impact strace can have on a process by running a quick test.

As you can see, dd completed in 0.268796 seconds.

Now running it with strace (we’re looking for a call that won’t happen), we can see it took 13.7205 seconds. That’s a 5434% increase! Needless to say, this would be devastating for any application. While this is a worst case scenario but I just wanted to show you the impact it has on a running application and to make sure you are running strace on a problem application. If you would like some further reading about performance, check out this article.

Pinning strace to a running process

One of the most practical applications of using strace is pinning it to the problem process. I’d say 90% of the time I’m using strace I’m using this method. To start, you first want to identify the PID of the process that you want to trace. Below is a script I found that allows you to grep for the process name; this way you don’t have to waste time finding the PID.

ps auxw | grep <process name> | awk '{print " -p " $2}' | xargs strace -o /tmp/trace.txt

One thing to note about this would be to make sure that do not strace your grep function. You can use a simple regex trick to get around this. Instead of grep /bin/nginx, you’d want to format it like grep “/bin/[n]ginx”. This way, your grep command is looking for /bin/nginx, but your actual grep looks like grep “/bin/[n]ginx”, preventing it from showing up in your results.

What’s taking so long!

Another useful application of strace is seeing what a process is spending all it’s time on. You can use the same script above, however you will at the -c flag. Your command will look like this.

ps auxw | grep <process name> | awk '{print " -p " $2}' | xargs strace -c -o /tmp/trace.txt

When you feel like you’ve run it long enough, go ahead and interrupt it and check out output file. It will look similar to this.

Well almost similar. This output is for the ls command.

Not another connection issue

Lastly, I want to cover how you can debug connectivity issues using strace. While there are plenty of tools out there that allow you to do this, I like using strace because it’s clean and there is little to no fluff. I’ve set up a VM that cannot connect to the internet. Lets debug it and see why. First we’ll want to run strace on nc to see what’s going on when connecting out to the internet. Using the -e flag, we can specify what system calls we want to look for.

strace -e poll,connect,select,recvfrom,sendto -o trace.txt nc yahoo.com 80

Our output will look like this:

First we see the connect call trying to connection to /var/run/nscd/socket. This is the Name Service Cache Daemon which is used for things like LDAP,NIS, or YP or other directory protocols for name lookups. After trying this twice and failing, it moves on to DNS (htons(53) means it’s connecting out on port 53, which is DNS). However we can see that it retries this and fails. We can determine that there is probably something wrong with our ifcfg-eth0 file. In this case, I just never turned on eth0 :)

Quickly running and ifup eth0 and rerunning strace, we can see there is a big difference!

We can see that after trying NSCD we successfully connect to DNS, it sends out a DNS packet to yahoo.com via the poll and sendto calls and a corresponding recvfrom call. To confirm that it successfully connected we can check for the EINPROGRESS call. This signals that the process was not blocked and is going to continue processing. This can be seen here:

Close()

To summarize, strace is a powerful tool that we can use to debug troublesome processes by reading their system calls. This is great in cases where we have programs running out of control, or where we cannot connect out to the internet. You must be mindful though when running this to not use it on processes running in production as you can seriously hinder the performance the application you’re tracking. All in all this is a great tool to learn keep up your sleeve.