Blazing fast Python

Profiling Python applications using Pyflame

Dane Hillard
Dec 19, 2018 · 5 min read
Image for post
Image for post

Perhaps you’ve faced the fortunate challenge of scaling up a Python application to accommodate a steadily increasing user base. Though most cloud hosting providers make it easier than ever to throw more hardware at a problem, there comes a point when the cost outweighs convenience.

Around the time scaling horizontally starts looking less attractive, developers turn to performance tuning to make applications more efficient. In the Python community there are a number of tools to help in this arena; from the built-in timeit module to profiling tools like cProfile, there are quick ways to test the difference between a particular line of code and any of its alternatives.

Although profiling tools help you see important information about which calls in your application are time consuming, it’s difficult to exercise an application during local development the same way your users exercise it in real life. The solution to bridging this gap? Profile in production!

Pyflame

After running Pyflame and a bit of post-processing you’ll be able to generate visualizations like this:

Image for post
Image for post
The y-axis is the call stack; the x-axis represents the proportion of CPU time spent in a particular call

But how?

$ git clone https://github.com/uber/pyflame
$ cd pyflame/
$ ./autogen.sh
$ ./configure
$ make

The pyflame executable can be found in the src directory after the build.

Now that you’ve built the tool, it should be ready to analyze your Python application. To do so, start by snagging the process ID (PID) of the application you’re interested in. If you’re running a Django application, for instance, this would probably be the PID of one of the WSGI workers running the Django app. After you’ve got a PID handy, you’ll run pyflame with options for:

  1. How long to sample the process, in seconds
  2. How often to sample the stack of the process, in seconds
  3. Removing the idle time of the application from the data to simplify the resulting visualization
  4. The name of the output file
  5. The PID of the process

This will look something like

$ ./src/pyflame -s 300 -r 0.01 -x -o my-app-20181212.samp -p 9876

Pyflame will sample the process with PID 9876 every ten milliseconds for five minutes, producing the my-app-20181212.samp data file. Trying to read this file yourself won’t make a lot of sense, but fortunately it’s in just the right format for FlameGraph.

FlameGraph

$ git clone https://github.com/brendangregg/FlameGraph
$ cd FlameGraph/
$ cat /path/to/my-app-20181212.samp | ./flamegraph.pl > my-app.svg

Open up the my-app.svg file in a browser and you’ll see something resembling the screenshot from earlier. You can hover over any segment to see what call it is and how much of the CPU time it represents. Clicking on a segment zooms into it so that you only see it and its children. Most usefully, you can perform a regex-capable search using the “Search” button in the top right. Searching for a particular pattern will highlight any matches:

Image for post
Image for post
Pink segments match the search query

By exploring a flame graph, you can more easily find slow and common calls in your application that aren’t always obvious from poring over loads of raw text. And by identifying these calls, you can start to understand how you might effectively pare things down into a leaner application.

The pink segments above represent instances of a particular method call within our application; this call isn’t particularly expensive, but occurs many times on nearly every request. We realized that while we had code that seemed like it was caching these calls, the TTL was only set to one second. By updating the TTL to ten seconds, we were able to reduce how often this call occurred, which was confirmed in the “after” graph:

Image for post
Image for post
Less pink means less overall CPU time spent on matching calls

So what?

Image for post
Image for post
Requests/minute in yellow, CPU load in green (the big dip at 10:40 is an artifact between deployments)

This drop in CPU load translated to increased throughput and a significant drop in response time:

Image for post
Image for post

Before this change, our application was largely CPU bound. We tune Apache to throttle connections in tandem with this behavior so that instances aren’t overwhelmed. After this change, there was enough headroom to reduce the application’s footprint by two server instances. We’re now connection bound and can look into opening that throttling up a bit, which could allow us to shed another instance or two. Across all our applications, especially those requiring significant computing resources, these kinds of optimizations can lead to thousands in savings annually.

Pyflame is not just helpful for identifying opportunities, but also helpful for confirming hypotheses. Optimizing calls that appear frequently/loudly in the flame graph can result in tangible benefits, which can ultimately reduce cost. I recommend adding this tool to your arsenal for performance testing and monitoring.


To see more about flame graphs, Pyflame, and profiling tools, check out these resources:

Build Smarter

Where developers, designers, and product people at ITHAKA…

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store