Surfacing performance issues with effective visualization of profiling data


Various visualizations of profiling data

There are two principles in software development:

  1. Efficiency and performance are increasingly important parts of software development.
  2. It’s on the software, or more specifically, the makers of the software, to create an intuitive experience. In other words, it is NOT on the users to guess what the software wants them to do.

Let’s delve deeper into the second point. I remember back in the days, sitting in my physics class at MIT, listening to the professor talk, understanding every word but not the sentences themselves. It takes a genius to be a professor at MIT, so I must have been the problem.

By now, I have long realized that just because a person is smart, it does not mean they’re good at teaching or explaining things.

The same analogy can be drawn to software.

When I started doing performance work, it was like sitting in a lecture, knowing the words but not the substance all over again. Yes, I’m looking at rectangles and colors, but how do I interpret this graph? How do I solve my problems?

Users shouldn’t have to jump through hoops to understand what is going on. This is true for common UX experiences, like a checkout flow, but it is also true for complex systems like observability and performance toolings.

Back to the first point: efficiency and performance are increasingly important parts of software development. As such, more and more engineers are turning to profiling to identify areas of optimization. Yet, to many non-experts, profiling itself is an intimidating concept.

I think one of the reasons profiling has this rep is precisely because it is difficult to create an intuitive user-experience.

This is where effective visualizations can really shine. Flame graph is usually the visualization engineers are greeted with when they first encounter “profiling.” However, it is not the only visualization tool, nor is it the most effective one in many situations as a standalone product.

Flame Graph

A Flame Graph Showing CPU Time by Method

Flame graphs can be visualized top down (as in the image above) or bottom up. Each rectangle is called a frame and represents a function in the stack. If you’re not familiar with flame graphs, here is a good intro.

Flame graphs are great for showing stack traces and summarizing how much resources our code uses. However, they’re limited in many ways. Very often, they’re hard to work with unless you’re already familiar with profiling and know exactly what you’re looking for.

I’ve been a frontend engineer on Datadog’s profiling team for 1.5 years. One of the big initiatives of the team is to make profiling more accessible to non-experts.

Over the last year, we have worked on several visualizations aimed at tackling “how can we help users understand what to optimize and why without jumping through hoops?”

One of these visualizations is called “Call Graph.”

Call Graph

Call graphs surface functions with the most self time. Self time is the time spent in the function itself; it excludes the time taken by other functions it calls.

Imagine a profile like this:

Flame Graph of an Imaginary Profile

In this example, the function seeBarbenHeimer is called from multiple places, with a relatively large self time. In a flame graph, the self time is displayed as the difference in width of the function’s frame and the frames below it. From the flame graph alone, it is difficult to see that this is the function for which optimization would have a big effect on performance. A call graph should immediately lead to this revelation:

Call Graph of Imaginary Profile

The size of each node should reflect the amount of function self time.

Another visualization the team has worked on is called “Timeline.”


Datadog Profiling Timeline View

Timeline shows profiling data by threads over time. It reveals fluctuations in activity, something flame graphs cannot show since flame graphs don’t have a time component. Additionally, by separating out different threads of execution, Timeline can help users diagnose issues involving concurrency.

If you would like a specific example, my colleague Felix wrote a great blog post on how to debug a slow request using Timeline.

Call graph and Timeline both illustrate the importance of having the right visualizations when tackling performance problems via profiling.

Recently, I’ve prototyped a third visualization called “FlameScope.” I first heard about FlameScope from a Netflix Blog.


FlameScope (PoC)

FlameScope has two components: an interactive subsecond-offset heatmap and a flame graph scoped to a user-selected time window within that heatmap.

Like Timeline, it solves the problem of surfacing small perturbations and variation within a profiled application that standalone flame graphs do not. Like Timeline, you can select small, seconds or subsecond timeslices to view the flame graph for that smaller time window.

Here, I selected a 450ms time window of the 60s profile (1)

However, unlike Timeline, which is targeted at advanced profiling users who have a good understanding of the runtime, FlameScope’s intention is to make the data that powers Timeline more accessible to beginners. Additionally, FlameScope is better at isolating intense periods of work. For example, it better captures patterns in CPU utilization like periodic background activities that cause temporary latency outliers.

In the screenshot above, the subsecond-offset heatmap reveals a pattern of high CPU activity every few seconds or so. This pattern is impossible to see with a flame graph alone.

FlameScope was a project I worked on during a recent R&D week. The profiling team has R&D week once a quarter. It is a time for engineers to experiment and prototype on what we think can improve the product for our users. It is one of my favorite things about working at Datadog because it’s an entire week of mostly uninterrupted time where we get to step outside the status quo and ask “how can we design a better outcome?”

Reading that Netflix blog post gave me an aha moment of “OH! That’s why flame graphs are so hard to use (if you don’t already know exactly what you’re looking for)!” Hence, the inspiration for my R&D week project.

Not all projects will eventually make it to production. Nevertheless, R&D weeks are never a waste of time because:

  1. They almost always lead to interesting conversations about potential directions, even if a particular project does not make it to general availability.
  2. They’re fantastic learning opportunities.
  3. They’re simply fun.

FlameScope was a project standing on the shoulders of giants. It creatively stitched together and built on top of bits and pieces of existing Datadog tools to surface relevant information in a new way. There are a few UX questions to answer and work through, namely how do we incorporate this visualization within our existing flame graph and timeline workflows to make a seamless experience. I’m excited to see where this project goes.

Words are building blocks of sentences. The flow of sentences influences a student’s ability to understand and parse their instructor’s content. UI elements are the building blocks of user experience. The flow of user experience influences a consumer’s ability to understand and parse the software’s capabilities.

They say a picture is worth a thousand words. Visualizations are powerful ways of presenting information. Flame graphs, call graphs, timelines, and FlameScopes are all different ways of visualizing profiling data. They each have their own use cases and scenarios where they can either falter or shine.