Performance Profiling

Jeremy Cowles
8 min readMar 18, 2018

--

I’ve done a lot of performance profiling over the years in somewhat disparate areas (rendering, parallel CPU & GPU compute, web load testing, file system I/O, distributed systems, mobile devices), but I’ve never really reflected on my process. I initially intended to do this just as a personal exercise, but then realized it might be helpful to share.

Rather than domain-specific profiling techniques, these are a collection of high level strategies that can be applied to any domain.

Why am I Optimizing?

Often I start with a problem in hand such as, “this scene doesn’t render fast enough”, but it’s important to be formal about the big picture goals. For example, “I want this scene to render faster because multiple consecutive dropped frames hurts the user experience” and even higher level goals such as, “a good VR experience is important because Tilt Brush is often the first VR experience a person has.”

This may seem silly or obvious, but if you’re like me, once profiling and optimization begins it can be a seductive thing that entices you into more and more optimization — in those moments, it’s important to use high level goals as a guiding light to help focus on what to optimize, how to optimize, and perhaps most importantly, when to stop.

Choosing Metrics

Profiling systems often provide a fire hose of data and selecting the most effective metrics can be deceptively tricky. For example, when measuring frame rate, should we optimize the average frame time, median frame time, average frames per second, percentage of dropped frames, or total time without frame drops? You probably shouldn’t answer without first returning to the question, “why am I optimizing?”

Snapdragon Profiler, Fire Hose of Metrics

In this example, I want to know that the user is having a good VR experience, so will median frame time be representative of that? Probably not. The percentage of dropped frames is probably a much better metric.

When faced with a large set of metrics, my approach to finding a collection of relevant metrics is to record them all and then run several tests with good and bad performance profiles; I then sift through the results to identify which ones are most tightly correlated to the good and bad test cases.

Finally, understanding the relationship between metrics can let you compute derived metrics.

Maintenance Cost

If long term maintenance is important, which it almost always is, each optimization should be weighed against the cost it imposes on future work. A 10x improvement can have immediate benefits, but if it also carries a 10x slow down on all future work, it may not be worth while. Premature optimization rightfully gets a lot of attention, but the cost of necessary optimization is also important to consider.

Measure Every Improvement

Of course this should be obvious, but at times it’s extremely tempting to commit an optimization because “I know it’s a good thing.” I try to measure even the most obvious optimizations, because:

  1. The measured improvement should be recorded as part of a source code commit message.
  2. The improvement may be better than I expected or the optimization may be surprisingly ineffective.
  3. The number may be useful to revisit later, e.g. when trying to identify future regressions.

The time it takes to measure performance is always worthwhile, and after all, you can’t call it an “improvement” unless you know it improves something.

Estimations, Simulations & Prototypes

Before optimizing, it’s useful to know the “speed of light” of the system; that is, the fastest / smallest / best outcome I can possibly *hope* to achieve. After profiling and identifying a bottleneck, I start to make plans for how it might be addressed.

Rather than jumping right into an implementation, I try to estimate the impact of the change vs. the current system and the speed of light. It’s also important to analyze the complexity of the improvement: is this a constant improvement or is it an asymptotic improvement? If it’s a constant improvement, is it worth while or will the gain quickly become ineffective against the known complexity curve.

Once the analysis is done and the change seems worth while, I still hold back on the full implementation, instead I try to target a representative prototype. This can be as simple as disabling code to simulate the final solution or a quick-and-dirty implementation of the actual optimization. The key is to verify the analysis and prove that the optimization works, before committing the time it takes to make the change production ready — that is, leverage the 80/20 rule to your advantage.

Recording Data

I tend to record numbers formally even for simple experiments, however that doesn’t mean they have to go into a database; often I just scratch them into my notebook or store them in a spreadsheet. Whatever the method is used, there are several critical items which I always try to include.

The value of each metric (of course), but I take care when rounding and aggregating to be sure every value is handled consistently across runs.

Unit of measure (seconds, megabytes, etc) is critical. This is an incredibly simple thing to do and it’s a guarantee that your data will make sense after your memory has faded.

Date of the experiment. Again, this is simple, but important.

The experiment itself should be described. That is, what was the state of the system when the numbers were recorded? I typically try to find a few descriptive words for the aspects of the system which I’m varying. For formal tests, a build stamp should be recorded.

Aggregation details should also be recorded; for example, if the number was averaged over 10 runs, this should be documented with the data. Aggregates should be consistent over all trials that are being directly compared.

Quiescence

Quiescence is a state of quietness or inactivity. Performance tests almost never run as a completely isolated system, so there are always external factors which will affect test results. When running tests, I try to achieve a quiescent state (as much as possible) in the host system. The exact approach must be decided on a case-by-case basis, but here are some common examples I’ve seen:

  • Stop unrelated background services.
  • Restart the process after every test run.
  • Load resources and wait for any asynchronous processing to finish.
  • Disable dynamic clock rates of CPUs and GPUs.
  • Run on a dedicated machine with no shared hardware (non-VM).
  • Warm caches by running the test twice in the same process.
  • Alternatively, explicitly flush caches before the process runs.
  • Explicitly trigger garbage collection before running the test.

For all of the above, you should ask your self if it makes sense for what you’re trying to achieve. In some cases, you want combinations of some of the items listed above, for example, you may want to warm caches but you may also want the cold cache measurement as well, to be used to set “worst case” expectations.

Reproducible Experiments

It’s easy to get into a groove where I’m quickly iterating, changing a test case, recording numbers, and rapidly changing the code without saving anything reproducible. I try to avoid staying in this mode too long and instead use this quick iteration to find useful experiments. Once a useful experiment is found, I check it in as a formal test which can be used by others to reproduce the findings. In addition to other engineers, reproducible tests can be used by an automated system to monitor progress and prevent regressions over time, after focus has shifted to other tasks.

Automation & Regression

For any serious optimization effort, you need automated tests. Any important optimization should have a regression test to ensure the optimization doesn’t regress. Properties of good automated performance profiling system are as follows:

  • Stores critical data noted above, including a build stamp.
  • Recorded data is immutable.
  • Runs tests in a quiescent state.
  • Sends notifications of test failures.
  • User defined thresholds for failure conditions.
  • Automatically identifies noisy / inconsistent tests.
  • Ability to enable/disable tests temporarily.
  • Tests can be run continuously or based on triggers, such as new changes.
  • Provides visualizations of metric data.

Targeting & Balancing Resources

Often there are many resources in the system which work together to produce the final effect. When designing a performance test, I try to consider what resources are being stressed and what resources I’m trying to target. For example, when testing performance of a fragment shader it’s important to consider the number of vertices in the test mesh, since vertex shader may actually become the bottleneck and hide the performance of the fragment shader.

It’s important to consider this both for the initial construction of the test and what will happen if the performance improves or degrades, since the bottleneck may transition from one resource to another during the coarse of optimization.

Synthetic and Micro Benchmarks

A synthetic test is a test which creates a scenario that is representative of real performance, but is a scenario which the user will never actually run. Micro benchmarks are a synthetic test which is extremely focused on an extremely tightly scoped aspect of the system.

The risk with synthetic tests is that they may not be representative of actual performance. By extension, this also true of micro benchmarks, however micro benchmarks carry an additional risk. Since a micro benchmark is executing in isolation of the entire system, they often show small changes as massive improvements.

For example, I may have improved cache coherency of a tight loop in isolation yielding a 10x speedup, but when run in the context of the entire system, the optimization may evaporate or the gain may just not be relevant, if say, the loop is only 0.01% of the overall relevant workload.

Communication

If I intend to share my progress with others, just recording raw numbers isn’t enough, instead I provide written analysis of the numbers and try to visually present the data in such a way that it’s easy to understand. When writing the analysis, I try to keep it focused and keep in mind a clear idea of what I’m trying to communicate.

Several metrics stacked into a single graph.

Graphs are helpful, but I also consider them for readability. For example, does the graph focus on one aspect of the data and does it clearly promote the idea I’m trying to share? It’s easy (and fun) to create a graph packed full of data, but it may actually be less useful than a simpler more focused graph with less information.

The same data, focused on a single metric.

When building a report, my workflow is to store data in a spreadsheet and link the relevant bits of it into the actual analysis document. I use the spreadsheet to manipulate the data in various ways and generate graphs. Again, any graphs or tables linked into the document are intentionally optimized for readability. Tools like MS Office or Google Docs let you link and embed data between docs, which makes it easier to keep them in sync.

Conclusion

Optimization is fun, addictive once you get rolling. I hope sharing my experience is helpful. Feel free to leave comments / questions / corrections here or ping me on twitter!

--

--

Jeremy Cowles

Director of the Machine Learning Artistry Lab at Unity Technologies. Formerly Google, Pixar, + 10 years of random startups.