Monitoring distributed traces

5 min readMay 31, 2018

Distributed application is not a simple mesh of intercommunicating services, it’s a mesh of problems that throw problems at each others. And we, engineers, must enjoy solving them =].

You probably know 4 Golden Signals of monitoring (latency, traffic, errors, saturation), which, if measured properly, provide decent visibility over the app. Applying them to each component is not that difficult: pick a service, measure how long does a it take to process a message (request), how much messages (requests) does it serve, how much resources does it consume and what errors does it report. Simple.

But what if I’m interested in something that spans multiple services and probably requires asynchronous processing?

A story of business transaction

Here we come to a concept of “business transaction” which is a quite useful entity to be monitored. Imagine such a situation: individual services of your distributed app perform blazingly fast according to benchmarks. However, users keep complaining that they have to wait for too long to receive their data. Why?

Monitoring of the whole message path revealed that at some point a message queue was a waaaay too long. And measuring individual performance of backend services couldn’t catch that fact, but analysing the whole path could.

Let’s nickname such a message path as “business transaction”. E.g., for webshop it can be described as a set of operations that end up in creation of new order. How to connect all involved operations and get observability over “business transactions”? One of the answers is: distributed tracing. For more details you can refer to several tools and standards in this area or my previous article where Zipkin-based tracing was applied to artificial distributed app.

Welcome our lab rat

As an example app I’ll continue with the exercise app described in the previous article. For it, let’s define “business transaction” as an action that begins when user creates/deletes a record on UI, and ends when it is logged by log-message-processor component.

Let’s define a single SLO to observe, just to keep the article short:

99% of business transactions in a given day take at most 2 seconds to complete.

Where given day is a 24-hours period from 00:00:00 GMT of some date.

Picking the right tools

Fortunately, the batteries are included: exercise app comes with Zipkin tracing. Unfortunately, Zipkin lacks alerting capabilities, because it’s not its job to alert. As suggested, feature-rich APMs should be used to capture, process and reason about traces. In such a case, we need to build something APM-ish with what we have :)

What would be a good choice for monitoring/alerting system? Something popular enough (meaning there is decent documentation and community support), open for configuration and free? Prometheus is what comes to mind first. Let’s make it store duration of traces and a counter for traces with errors.

The only missing piece is a bridge between Zipkin and Prometheus. Couldn’t find a tool that I could use immediately (please let me know if it exists!), that’s why quickly prototyped zpe for this job.

At the end, this is how collection of trace data and metrics look like:

Collecting metrics

For monitoring latency of traces Prometheus Histogram is a proper choice. There is a nice article that provides some details about other SLI definition with Prometheus. But for now, zpe exposes only histograms.

I set it up to query Zipkin for all the traces relates to service “log-message-processor” , exposed histogram provides buckets [200.0, 500.0, 1000.0, 2000.0] where a number means millisecond. What’s critical for our case is a bucket 2000.0ms: to monitor our SLO we need a ratio of traces with latency exceeding 2000ms to total number of traces.

Let’s generate some traffic and see how histogram looks like in prometheus:

each line corresponds to number of “hits” in a specific histogram bucket

And this diagram shows what we really care about: the percentage of traces that took 2000ms or less. Aggregated for 24 hours, this diagram will represent SLO defined above.

Prometheus query: loggable_activity_durations_histogram_bucket{le=”2000"} / ignoring (le) (loggable_activity_durations_histogram_count) * 100

At this point we can employ Prometheus Alerting and create some alert rules that will scream if SLO is broken. But from what I see, it’s broken already and we’re faaaar away from processing even 95% of business transactions within 2000ms.

Let’s figure out what’s wrong

Prometheus told us that there is a problem, now it’s time to dig a bit deeper and fix it. Let’s see what Zipkin says:

And here’s a single trace that clearly says that log-message-processor itself is a culprit:

After fixing the bug, Prometheus started showing a way more pleasant diagram:

The ratio is going up, giving me hope that SLO will be held by the end of the day :)

Conclusion

Welcome to the finish point. Today we learned that distributed tracing provides a single unit that can be monitored — a trace. A unit that includes multiple services. A unit that represents a single customer’s action. An action which the customer cares about.

Having a single unit, it’s not that difficult to apply 4 Golden Signals of monitoring to get observability of quite complex system. Providers of commercial solutions (Datadog, Instana, Lightstep to name the few) know that fact. However, is there any Opensource full-featured APM that does the job? I hope to hear from you in comments ;).