API Profiling at Pinterest
Anika Mukherji | API Intern
When I walked into Pinterest on the first day of my internship and learned I’d be focusing on profiling the API Gateway service — the core backend service of the Pinterest product — my only thought was “What is profiling?”. Profiling is often shoved aside as a side project or lower priority concern, and it’s often not taught in college CS courses. Essentially, writing services come first, and profiling them is a distant second (if it happens at all). Moreover, profiling is not always seen as a precursor to optimization, which can result in wasted time optimizing code that doesn’t significantly affect performance in production. That being said, profiling is a critical step in the software development process in order to create a truly performant system.
Before my arrival at Pinterest, a basic webapp had been built to accompany a regularly scheduled CPU profiling job (and consequent flamegraph generation) for all of our Node and Python services. My primary goal for the summer was to expand this tool to support our API Gateway service while making it flexible for use in other services in the future. The ultimate goal is to use it for profiling of all Pinterest services. The three arms of functionality I worked on were memory profiling, endpoint operational cost calculation and dead code detection.
Solving for increased optimization
I primarily worked on optimizations, including expanding resource tracking and profiling tooling. In terms of performance in production, our evaluation of resource utilization for the API Gateway service was limited to CPU usage. There was a need for a holistic assessment of which parts of the API Gateway service were performant, and which parts of the codebase needed quality improvement. With that information, developer resources could be allocated to the least performant endpoints, and we could improve the overall process of optimization.
What exactly is profiling?
Software profiling is a type of dynamic programming analysis that aims to facilitate optimization by collecting statistics associated with execution of the software. Common profiling measurements include CPU, memory, and frequency of function calls. Essentially, profiling scripts are executed in tandem with another executing program for a certain duration of time (or for the entirety of the script being profiled), and they output a profile (i.e. a summary) of relevant statistics afterwards. The recorded metrics can then be used to evaluate and analyze how the program behaves.
There are two common types of approaches to profiling:
Event-Based Profiling:
- Track all occurrences of certain events (such as function calls, returns, and thrown exceptions)
- Deterministic (more accurate)
- Heavy overhead (slower, more likely to impact profiled process)
- Example Python packages include: cProfile/profile, pstats, line_profiler
Statistical Profiling:
- Sample data by probing call stack periodically
- Non-deterministic (less accurate, though you can mitigate through stochastic noise reduction)
- Low overhead (faster, less likely to impact profiled process)
- Example Python packages include: vmprof, tracemalloc, statprof, pyflame
We opted for statistical profiling for our production machines because of the lower overhead. If the job is run regularly for long periods of time, accuracy increases without increasing response latency due to heavy overhead. While profiling is important, it should not harm production performance.
Memory profiling
TL;DR: tracemalloc to track memory blocks
Our API Gateway service is written in Python, so the most apparent solution was to use an existing Python package to gather memory stack traces. Python 3’s tracemalloc package was the most appealing, with one large problem: we still use Python2.7. While our Python 3 migration is underway, it’ll be many months until that project is completed. This incompatibility forced us to patch and distribute our own copy of Python, in addition to using the backported pytracemalloc package. Just another reminder that updating to the latest version of Python is ideal for both performance and utilization of latest tooling.
The basic approach here was to run a script on a remote node (one of our API production hosts) that sends signals 15 minutes apart that trigger signal handlers (functions registered to execute when a certain signal arrives).
Signals were a fitting choice because they don’t add any overhead when not running the signal handler and because we don’t want to enable profiling all the time on all the machines. (Even a 0.1% overhead at scale is expensive.) We decided to overload the SIGRTMIN+N signals to start and stop the profiling job on a received signal. The stack traces are collected and saved to a temporary file within /tmp/. Another script is run on the remote host to produce a flamegraph, and then all files are saved to a persistent datastore and sourced by our Profiler webapp.
Operational cost calculations
TL;DR: Finding the expensive endpoints (and their owners!)
The calculation of endpoint operational costs required the combination of two sorts of data: resource utilization data, and request metrics. Our resource utilization information is given in two units — USD and instance hours — and is provided on a monthly basis.
Using request counts, the relative popularity of each endpoint can be calculated. This popularity is used as a weight to divide total resources used by the API Gateway service. Since most of our request data is in units of requests per minute, I decided to break cost down to that time scale as well. As each API endpoint has an owner, average operational costs for a given owning team is also calculated.
The ability to identify the most costly endpoints, as well as the engineers/teams to whom they belong, encourages ownership and proactive monitoring of their performance. It’s important to note these calculated metrics aren’t absolute sources of truth; their significance instead lies in how they compare relative to one another. The ability to identify unperformant outliers is the main objective, not quantifying exact monetary impact.
This approach is naïve in that it doesn’t properly account for CPU time, or make distinctions between costly handlers (endpoint-specific functions in the API Gateway) and costly requests. For example, requests can trigger asynchronous tasks which aren’t necessarily attributed to the API Gateway Service, the same endpoints with different parameters can have different cost structures (as can different handlers) and downstream service processing isn’t associated with a given API request.
We could address these deficiencies by creating an integration test rig that runs a set of known production-like requests and measures CPU time spent relative to the baseline for the application. We could further maximize impact of this by incorporating it into our continuous integration process, giving developers key insights into the impacts of their code changes. Additionally tracing via a given Request-ID would enable more holistic coverage for our overall architecture.
Dead code detection
TL;DR: Uncovering abandoned code (and deleting it)
Unused and unowned code is a problem. Old experiments, old tests, old files, etc. can rapidly clutter repositories and binaries if they’re able to fly under the radar. Discovering which lines of which files are never executed in production is both useful and easily actionable. In pursuit of identifying this dead code hiding in our service, I employed a standard Python test coverage tool.
While the primary use of a test coverage tool is to discover which lines of code are missed by unit and integration tests, you can run a job to run the same tool on a randomly selected production machine to see what lines of code are “missed”. As the job is run several times a day, the lines that are commonly missed in all runs for a given day are surfaced. An annotated version of the file is shown for easier visualization of which lines are “dead” and who to contact to see if the code should be removed.
This is a fairly naïve implementation to begin detecting dead code. The codebase in question may be used by multiple services and jobs, and determining the dead code in common among all of them is a complex problem that still needs to be more carefully addressed. It’s also fairly expensive as it uses an event based collection technique rather than statistical sampling.
What’s next
TL;DR: it’s all for optimization
I don’t have much experience with “big data”, but after building these tools and starting to run the jobs regularly, I was bombarded with large influxes of data. My gut reaction was to shove it all into the webapp and leave developers to figure out what was useful (more is better, right?). However, I quickly learned that while this data made sense to me as someone who spent weeks working on generating it, it was opaque and arguably impenetrable for engineers who hadn’t used flamegraphs or lacked perspective into operational cost. With respect to utility, simply disseminating the raw data was far from optimal.
It came to my attention the new features I created would most likely have the following primary uses, so these were the key insights to be surfaced:
- Finding files and functions that use the most memory
- Engineers finding how expensive their API endpoints are
- Starting point for cleaning dead code out of our repositories
- Finding the most popular and costly parts of the API
To spread awareness of this tool around the company, I held an engineering-wide workshop with flamegraph-reading and other profiling analysis activities. In just two days, two different potential optimizations (single line changes) were found and realized, saving the company a significant amount of annual spend.
At a surface level, these use cases provide a wide range of insights on resource utilization by the API and what parts of the codebase are used less in production. The birds eye view, however, is much more exciting and motivating. Not all parts of the codebase are created equally — some functions will be executed a much greater number of times than others. Spending too many hours on rarely executed endpoints is a poor use of developer resources and is the worst possible strategy to optimize performance; in other words, blind optimization is not really optimization.