Find bottlenecks in just 30 minutes using Jaeger traces

Published in

JoomTech

11 min readMar 14, 2023

Author: Artem Klyukvin

Hi, My name’s Artem, and I’m a backend developer in the client backend team. One of the important parts of my job is reducing the latency of our backend. That exact issue is what inspired this article, so let’s get into it.

In one of our checkout endpoints, the 99th percentile latency breaks the SLO. We need to fix this.

Accordingly, the question arises: How can we quickly and accurately find the cause of slowdowns on a very low-frequency request at the 99th percentile, and what can we do to eliminate it? The answer turned out to be a library for semi-automatically searching for bottlenecks in distributed systems. You can find the relevant GitHub link at the end of the article.

What are the standard tools for finding performance bottlenecks?

Several of these tools are already available in the public domain.

Profiling, e.g. with pprof.
Building dashboards with performance metrics. For example, Prometheus + Grafana.
Studying the execution traces of the requests we are interested in. We obtain these, for example, using Jaeger.

Let’s take a closer look at these three paths , including their advantages and disadvantages.

Profiling won’t help in a distributed system

Profiling helps with find runtime, CPU, and memory bottlenecks. In the case of the Go ecosystem, the profiler is generally available right out of the box. Therefore, if we have a dominator in the code that’s not only going to the database, but also mining crypto for a neighboring team, then they can easily be found in this way.

But a profiler will not help if we are talking about a distributed system with dozens (or even hundreds) of services written in different languages. In such systems, it’s not always possible to uniquely identify the problematic service that needs to be profiled. Therefore, we want to have a tool that can analyse the performance of the entire system as a whole and identify problems across all services at once.

Such a tool is super useful if you need to speed up the backend, but the problem becomes responsibility of some other team. In this scenario, the work goes much more smoothly if the task is formulated as “John, your service slows down on request X to base Y — try to speed it up, please”, and not “John, it seems that your service is slowing down somewhere — make it faster.”

In our case, when accelerating a request, it turned out that the neighboring service was slowed down due to the fact that the data was in an overloaded database. Migration to a nearby base should help, and since we already knew exactly what to do, the team responsible for this service was able to complete the migration. As a result, optimisation was completed fairly quickly.

Another problem with profilers is that we only get information about how the service performs in general. That is, if we have some critical, but low-frequency request that doesn’t make a serious contribution to the total resource consumption, we won’t be able to understand how it can be accelerated using just the profiler.

For example, in our backend, some of the most frequent requests are requests for a product card and the main page. Any bottlenecks in these areas will stand out and easily be found! On the other hand, shopping carts, checkouts and user profiles are requested thousands of times less often, meaning even if these queries have very serious performance problems, they will be almost invisible on profiles. This is especially the case if these problems appear only at higher percentiles (for example, p95 or p99). In this case, getting the profiler to answer the question “What slows down in a certain query at a certain percentile?” is not possible at all.

Long story short: Profiling can help find global bottlenecks in a particular service, but this tool is not suited to identifying problems in specific user scenarios in distributed systems.

Dashboards with performance metrics — you can’t please everyone

If you already have a dashboard that shows the latency of the problem area, that’s great! But the fact is that in order to build such a dashboard, you first have to find the problem. Building each dashboard requires some effort, and building them in advance for all occasions is more or less unrealistic.

Returning to the example from the previous section, it’s highly improbable that someone would build a dashboard for a very specific request to the exact collection in the database given there are hundreds of collections there. But even if such a dashboard suddenly appeared, how would you know to check it?

Therefore, instead of a bunch of disparate dashboards, it’s better to have a tool that would passively do the working of collecting all the information (more or less) necessary for performance analysis, and then would allow us to make queries of arbitrary complexity to the collected data and highlight probable problems.

Manual viewing of Jaeger traces is great, but something is missing

Jaeger traces are tree structures, each node of which contains the execution time of the corresponding method in the process of processing a request.

Collecting such traces allows you to eliminate all the previously mentioned disadvantages of profilers and dashboards.

The trace has parameters such as endpoint and execution time. We can focus on just the requests we really need.
The trace may contain information about all services involved in processing a request. We can at a glance cover a distributed system with an arbitrary number of components.
Traces are collected passively. If we decide to see why a shopping cart slows down for us, we don’t need to start wrapping it with metrics and building a new dashboard — simply ask Jaeger for the appropriate traces.

But unfortunately, this tool also has problems.

Long traces (up to several thousand spans). Problems can be difficult to spot by eye.
On the higher percentiles (p95, p99) things are tricky. From single traces it’s difficult to understand what slows down the most — the slowdown can be different for different traces, such as the database, the cache, the search, etc. It can be difficult to understand where to concentrate your efforts.
Jaeger’s UI does not allow you to conveniently look at the latency of individual subsystems. You have to open the entire trace, and only then look for the necessary calls.
There is no way to estimate how much the proposed optimisations will improve the performance of the entire system as a whole.

Returning to the example from the section on profiling — although the slowdowns in that collection were the main problem, we didn’t find long queries on every trace. It was also difficult to unequivocally decide if it was worth focusing efforts on collection migration as opposed to, for example, optimising requests to Redis.

Our solution — Jaeger + Spark

Of all the ways to find bottlenecks, the analysis of Jaeger traces seemed to be the most promising. But in order not to have to deal with viewing single traces manually, we decided to upload them and analyse them using third-party tools.

When the question arose of how to build some kind of aggregated statistics on top of traces, the first thing that came to mind was Spark (spoiler: we stopped there). At Joom, we use Spark for many different tasks, such as:

building dashboards based on client analytics events,
processing of historical logs,
building pipelines for our ML teams.

So, in order to analyse traces in Spark, you must first upload them. To do this, we used Jaeger-collector (the service responsible for forwarding traces to long-term storage) in order to use Kafka as secondary storage. Next, the periodic job extracts traces from Kafka and stores them in S3, and from there we can read them in our spark jobs.

Now the traces are at our disposal, and all that remains is to decide what to do with them.

A search on the web for ready-made solutions for batch analysis of execution traces led us only to a single Microsoft article. But despite its effectiveness, the algorithm proposed in the article is quite difficult to implement, and also needs to be adapted to our needs.

In this regard, we set ourselves the task of writing a tool that, based on the collected traces, would be able to answer the following questions.

I have a set of requests of a certain endpoint with a certain duration. Which calls in this slice takes the most time?
If I manage to speed up some calls (by a percentage or by a fixed amount), then what will the latency percentiles of the corresponding slices that I am interested in be?

To solve the first question, we decided to follow the simplest path — we select the required trace slice and inside it summarised the durations of all spans by the name of the operation. An arbitrary string can appear as an operation name. For example, the full name of the method. After that, we can sort in descending order the execution time of all the spans and understand which one is potentially the most difficult. This is a rather rough estimate, but for our purposes, it still turned out to be quite useful.

With the second question, everything turned out to be somewhat more complicated, since calls can be performed both sequentially and in parallel. Let’s say that a call to a third-party service takes 100 ms. At best, a 95% speedup here can at best speed up the entire query by 95 ms overall.

And at worst, it won’t speed up at all if the trip to this service was done in parallel with an even longer request.

In a trivial case, we can just look into the codebase and find out exactly how the request is made in relation to all other requests in the trace. But if our codebase is big enough, it’s easy to miss running a new goroutine six frames up the stack and waste time optimising something that won’t speed things up overall.

Therefore, when writing our library, we took a more reliable path — using Scala code, we apply hypothetical optimisation to each trace from the slice. Hypothetical optimisations can take the following forms:

operation X is accelerated by Y%;
operation X is accelerated by Y ms;
all X operations after the first one are accelerated by 100% (this formulation is useful in case of caching repeated calls within one trace);
…and potentially many more.

After that, based on the obtained synthetic traces, the latency distribution is constructed over the percentiles of interest for each of the synthetic optimisations. Based on the results of such a simulation, we have a fairly reliable estimate of what we’ll get from optimising the operation of one or another system component. Now you can start actually refactoring the code!

Let’s get to work!

Let’s use a real example to see how the approach described above helped us find a problem in one of our checkout flow requests.

Searching for optimisation

First of all, we look at the details of what the service does during the execution of the request.

On this graph, operations are plotted along the horizontal axis, and the ratio of their total execution time to the total execution time of all traces in the slice is shown along the vertical axis. That is, on average, each call to the checkoutdata.GetBundle method takes 15% of the total request time. Sometimes this is enough information to start thinking about where to look for bottlenecks. If this isn’t the case, you can request the same diagram for any of the child operations. For example, let’s see what happens inside a checkoutdata.GetBundle call.

So, we see that in the GetBundle request, a significant part of the time is spent going to the DaData address suggestion service.

Looking closely at the code, we noticed that in this request we’re not interested in the entire address as a whole — just the region is enough. That is, this query can be efficiently cached. But before jumping into the fray and screwing up caches, let’s find out how much that would actually speed up the request.

Simulation optimisations

Let’s say we can cache a call to the address suggestion service, and the cache cunning will be around 95%. It is necessary to evaluate how much such a cache will speed up the analysed query as a whole. For simplicity, we will assume that in all traces we speed up the call to DaData by 95% (if we want to simulate the behavior of the cache more realistically, we need to speed up the call to DaData by 100% with a probability of 95%, with a 5% probability of not speeding it up at all) . We apply this synthetic optimisation to all traces of interest to us and get the following graph.

The query latency percentiles are plotted along the horizontal axis, and along the vertical axis, by what percent the corresponding percentile will accelerate when hypothetical optimisation is applied.

We see that the greatest influence of such optimisation is concentrated in the zone of lower percentiles (p75, p85). At the same time, p99 reacts relatively weakly — only by 5%. However, a 10% improvement in p50 looks very promising. Therefore, it is worth trying to add caching to this request. In reality, after we added such caching, the latency of this request on p50 only fell by about 6–7%.

Other examples

The example described above is just one of many where a Jaeger trace analysis library helped us.

These were some more successful optimisations:

Found and neutralised an extra trip to a third-party service in the pricing module. The result was minus 6% latency on the p50 of the main screen.
Found an extra trip to the ranking system on the shares screen. The result is minus 40% of the p99 latency of this screen.
We found a trip to an overloaded base at checkout (the same one that we talked about when comparing different methods of searching for bottlenecks). The result is minus 12% of the p99 latency of this query.

Conclusion

As a result, by applying the full power of Spark to the traces unloaded from Jaeger, it was possible to create a convenient tool that lets us search not only for potential problems in large distributed systems, but also to give a detailed assessment of the potential of optimisations before their implementation. If you also use Spark and Jaeger in your work, you may also find it useful. You can find the open code here on GitHub. If you have any questions, please leave a comment below — I’ll be happy to answer!