Inspecting rogue PHP processes

Nick Peirson
6 min readSep 21, 2018

--

If you’ve worked with PHP you’ve likely encountered the situation where you’ve had a PHP process running that wasn’t behaving quite as expected. Often this is reproducible and can be investigated in a development environment with Xdebug, or the odd die('here'); if you don’t have Xdebug to hand.

Even if it’s not readily reproducible, you may have a tool hooked up that produces profiles from your production environment (e.g. Tideways, XHProf, Blackfire, NewRelic, etc), allowing you to capture a profile that allows you to determine what’s happening.

What about the situation where the process doesn’t complete? It gets stuck waiting for a remote service to respond, or goes into an unforeseen endless loop in some obscure codepath? The profile never gets saved as the process never completes, so it never shows up in our profiling tools. It only happens once in a blue moon, but the stuck processes subtly build up, consuming your PHP-FPM worker pool or Apache threads, taking your service down at the most inopportune times.

This is where you need to be able to jump in to a running process and find out exactly where it’s got itself stuck.

An oldie but a goodie

So you’ve identified a rogue process and got it’s PID. What now? Well, assuming you’re running on Linux, you can ask the process what system calls it’s making using the strace command.

Lets see how that can help us:

# strace -s128 -p6720
poll([{fd=8, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 1000) = 0 (Timeout)
poll([{fd=8, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
poll([{fd=8, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 1000) = 0 (Timeout)

So, what happened there? We ran strace telling it to set the maximum string size to 128 (-s128) and to trace the process with pid 6720 (-p6720). Setting the maximum string size isn’t necessary but often makes the output more readable so I tend to include the option by default when running strace.

strace then started outputting all of the system calls that the process is making. Here we can see that the process is calling poll. Assuming you’re not intimately familiar with Linux system calls, a quick google (or man 2 poll) turns up the following:

NAME
poll, ppoll — wait for some event on a file descriptor

SYNOPSIS
#include <poll.h>

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

This looks like what we saw in the output of strace, so we now know that our process is waiting for an event on a file descriptor. Huzzah!

How does that help us? Well, we can use lsof -p6702 to see what file descriptors the process has open, look for file descriptor 8 (fd=8 in the strace output) and see what it’s connected to. This might be a local file, a TCP connection to a database, or some other resource that your script is dependant on. This doesn’t give us a line number to go look at in our PHP code, but this is still a valuable starting point to determining what’s going wrong and should point us in the right direction.

Quite often you will see output in strace that you can directly link back to your code, i.e. a specific database query that’s only called in one place, so you know where to go look in the code to start investigating.

Strace 2.0

strace is a powerful tool, but it can be a bit laborious to wade through the output, googling to decipher system calls, and firing up separate tools to dereference identifiers.

©2014 RenOfSwagzareth (http://fav.me/d7j9hha)

Enter sysdig. sysdig is ‘a simple tool for deep system visibility, with native support for containers’. This is the workhorse that digs into your system to tell you what’s going on. Typically you would run csysdig, which is ‘a simple, intuitive, and fully customizable curses UI for sysdig’, to do realtime analysis, or you would use sysdig to produce a capture (e.g. sudo sysdig -w capture.scap) and view it in the Sysdig Inspect gui tool.

In the context of the above investigation we did in strace, this can be done in csysdig pretty trivially. When you first run it it gives you a list of running process in a similar manner to top. You can then filter to a specific process and dig into it’s system calls, what files it has open, what IO it’s doing and a bunch of other things you may not even have known your process was doing.

It’s by far the most comprehensive tool for investigating errant processes that I’ve used. If you’ve previously used strace to dig into processes I recommend taking csysdig out for a spin, or Sysdig Inspect for those more comfortable in a gui.

Show me the stack trace!

Chances are you can get pretty far with the likes of strace, sysdig, et. al., but maybe you don’t know where to start with system calls, or tools operating in and around the level of the linux kernel are a little daunting, or you do have a handle on them, but are struggling to relate them back to your PHP code.

If that’s the case, then you need phpspy. phpspy is a sampling profiler that will connect to a running PHP process and produce stack traces at every sample interval.

This is a fairly new tool and I did find a few rough edges when using it, but the results were still very impressive. At the time of writing there aren’t any packaged releases or binaries to download, but the process to build it is straightforward and is documented in the repository’s readme file. When I built it it did throw a few compilation warnings, but it happily built and ran.

Once you’ve got it built, give it a try by finding a PHP process and hooking up to it, e.g. phpspy -V 71 -p 6720 -l 5. This runs phpspy telling it to expect PHP version 7.1 (-V 71), to attach to process with PID 6720 (-p 6720), and limit the output to five samples (-l 5).

This will spit out 5 stack traces showing you what the process was doing each time it was sampled. That is basically black magic as far as I’m concerned. Never again will a process be stuck with me left wondering what it was doing in those final moments before I issue a kill -9 to put it out of its misery.

As if that party piece wasn’t enough, you can pipe the output through a couple of perl scripts (bundled with phpspy) to produce flame graphs, e.g. phpspy -V 71 -p 6720 -l 1000 | ./stackcollapse-phpspy.pl | ./flamegraph.pl > flame.svg. This allows you to look at aggregated statistics over time, which can help identify what’s consuming resources if it’s not clear from a few sample stack traces.

What did we learn?

You’ve now got a handle on how to dig into a process that’s behaving strangely, even if you didn’t instrument it with a profiling tool before you started it. This is a valuable debugging technique which can get to the root cause of a stuck process. Going forward, you should be able to avoid techniques that mask these types of issues, like a watchdog script that looks for a long running process getting stuck and restarts it.

It’s a great feeling being able to report to your boss or your client that you’ve identified and fixed the issue, rather than having to go cap in hand each time it happens, not knowing the cause.

This really just scratches the surface of the information you can get back from a running process. There are other tools to investigate, like ltrace that gives you library calls for a process, or dtrace if you’re on a BSD based system, like MacOS for example. (There is also a Linux port of dtrace, but I don’t know what the state of it is.)

Some of these tools may seem a little overwhelming at first but there’s loads of usage examples on the web. Chances are someone has done what you’re trying to do so even if you can’t figure out exactly what arguments you need to feed to strace to get what you want, it shouldn’t be more than a quick search away.

Happy debugging!

--

--

Nick Peirson

Technical excellence in PHP and associated technologies