Distributed tracing in practice

Ivan “elgris” Kirichenko
10 min readFeb 22, 2018

--

Hey there! Today we’re going to talk about distributed tracing. This post won’t give you much theory about the topic, instead it’s full of practice. I’m going to take distributed “polyglot”-ish application, pick appropriate tracing tool and then marry the two.

But what makes distributed tracing useful? In short: it allows to get better observability of the distributed system (one of the 3 pillars of observability) and helps to setup accurate SLOs for business operations, including async ones. Details will follow.

Meet our lab rat

The web app to be instrumented is a toy project which sole purpose is to serve as a playground to try tools that help to build and control distributed apps: Kubernetes, Docker, distributed tracing, etc. It’s made of several components written in Javascript (both web UI and one of backend services), Go, Java and Python. Here’s a simple diagram of interaction between the components:

Please check the Github repo out if you’d like to get more details.

Let’s select instrumentation tool

In order to instrument the app properly, we need a proper tool. There are several distributed tracing tools in the wild. I narrowed the scope and put these on my plate to choose from:

  1. Zipkin
  2. Jaeger
  3. Opentracing
  4. Opencensus

Acceptance criteria:

  1. The tool should define a standard way of using tracing across different languages as well as standard wire protocol.
  2. The tool should support the languages and frameworks used in the app: Javascript (both browser and server-side), Java, Go, Python3.
  3. Visualisation of the traces should be there, for free, ideally installed in-premises.

Zipkin

Zipkin is a mature battle-tested project, its design is based on the Google Dapper paper. If what I heard on LISA17 conference was true, then Zipkin is the most popular distributed tracing system in Canada (according to a survey).

The project provides tons of ways to integrate with the app. Lots of languages are supported, many integrations with popular frameworks make the process seamless. In case of any questions, vast community as well as numerous examples are there to help.

Zipkin matches all criteria listed above. Although, I got confused a couple of times by Zipkin libraries and data model. For instance, some libraries use term traceId, others call the same thing a span. My personal opinion that Opentracing defines simpler interface.

Verdict: maybe.

Jaeger

Jaeger is similar to Zipkin. Built with different technology stack, it serves the same purpose. However, Jaeger was thrown away because of:

Also, despite more fancy-looking UI, I found its search functionality a little bit annoying: you have to specify the name of the service, you cannot just search for all the traces that match given set of criteria. With Zipkin it’s possible

Verdict: clearly no.

Opentracing

Well, Opentracing is not a real “tool”, but rather a standard, explaining how to make your code collect traces. But how to manage the traces, how to transport them to the collectors and how to store collected data — this is responsibility of certain implementations, for instance:

  • again Jaeger, which has been already rejected for this exercise.
  • Zipkin. There is a bridge between Zipkin and Opentracing standards, implemented for several languages. Perhaps, it could be possible to integrate it in such way into Python and browser Javascript, but it’s definitely an overkill for the exercise.
  • There are other tracers listed in a corresponding page of Opentracing doc, however they are proprietary and could not be easily installed on my laptop.

Verdict: still going with Zipkin without Opentracing. Despite Opentracing provides simpler interface, I didn’t want to spend extra time for integrating a bridge to Zipkin.

Opencensus

Another attempt to standardise collection of traces (and not only traces), including wire protocol! Very interesting and promising concept, although almost everything is in alpha stage and being actively developed at the moment.

I planned to collect tracing data with Opencensus and then forward them to Zipkin or Jaeger to store and visualise. However, the project is being developer so quickly, so I might have ended up with outdated set of tools by the end of the exercise.

Verdict: too immature at the moment.

Conclusion

After considering several options, I decided to go with Zipkin, because it was the simplest way at the moment. It worth to investigate other options, they look pretty interesting for the long run (especially Opencensus that standardises collection of all 3 observability aspects: traces, logs and metrics).

If you’re planning to implement distributed tracing in your organization, follow the advice that I heard many times from different people: start fast, then iterate. Meaning, start with the simplest solution that requires minimal time for integration (e.g. Zipkin), implement it within your team and, perhaps, 2–4 other teams, who agree to participate in the experiment. At this point you need to build enough expertise about distributed tracing in general and collect as many requirements for the tracing system as needed. When it’s done, you’ll be able to rethink your choice and either continue with the tool you picked or change to something else.

Injecting Zipkin into the app

There is nothing difficult with that, thanks to extensive Zipkin documentation. I’m going to give a brief explanation for each framework I used.

VueJS, Javascript (for web UI)

For the first time, things were a little bit complicated, but with the interceptor for vue-resource it became like a breeze. Please refer to the README to get step by step integration guide.

SpringBoot, Java (for users-api)

Extremely easy to integrate, just follow the manual.

echo, Go (for auth-api)

Actually, Zipkin provides a standard middleware and HTTP client that operate tracing data. The only thing I needed to do is to convert Zipkin middleware to echo-specific middleware with a couple of simple steps:

//...initialise tracer...
e := echo.New()
tracingMiddleware := zipkinhttp.NewServerMiddleware(tracer)
e.Use(echo.WrapMiddleware(tracingMiddleware))

express, Javascript for nodeJS (for todos-api ).

HTTP part was very easy to integrate, because express framework is covered by corresponding instrumentation library.

However, if you take another look at the architecture diagram, you’ll notice that todos-api sends messages to log-message-processor over Redis asynchronously. I don’t want to lose this operation from the whole picture, that’s why a little bit of manual instrumentation is needed:

  1. First, notice that there is a package that provides a CLSContext. It helps to simplify access to current span context in the code. Basically, every time todos-api receives an HTTP request from frontend, it creates a span context and stores it in continuation local storage. Then later I can extract span context from the storage. What for?
  2. For serialising it into JSON object! Then it’s easy to include it into the message and send the message to the queue. All necessary information, including traceID and spanId will be there, so log-message-processor can pick them up. Here’s a method of TodoController that sends messages to Redis:
_logOperation (opName, username, todoId) {
this._tracer.scoped(() => {
var tracer = this._tracer;

// get current span context from continuation local storage
const traceId = this._tracer.id;
// send the message to the queue
this._redisClient.publish(this._logChannel, JSON.stringify({
// include span context into the message
zipkinSpan: traceId,

// include actual business data into the message
opName: opName,
username: username,
todoId: todoId,
}))
})
}

Python3 (for log-message-processor)

There is a third-party integration library from Yelp. IMHO, it looks cumbersome, because the example suggests providing all Zipkin-related configuration parameters every time I need to start a new span. I would rather create a tracer when my application starts, then pass it around where needed and then create new spans with it. Similar to what another instrumentation library for Opentracing does.

Anyways, for my little exercise py_zipkin is OK, but for bigger projects you may wrap it with some bells and whistles that hide cumbersome parts. In my case “integration” looks like this:

zipkin_url = os.environ['ZIPKIN_URL'] if 'ZIPKIN_URL' in os.environ else 'http://127.0.0.1:9411/api/v1/spans'# transport function to be used by the tracer as transport_handler
# see usage below
def http_transport(encoded_span):
requests.post(
zipkin_url,
data=encoded_span,
headers={'Content-Type': 'application/x-thrift'}
)
message = receive_message_from_redis()
span_data = message['zipkinSpan']
with zipkin_span(
service_name='log-message-processor',
zipkin_attrs=ZipkinAttrs(
trace_id=span_data['_traceId']['value'],
span_id=generate_random_64bit_string(),
parent_span_id=span_data['_parentId']['value'],
is_sampled=span_data['_sampled']['value'],
flags=None
),
span_name='save_log',
transport_handler=http_transport,
sample_rate=100
):
log_message(message)

As discussed earlier, span data comes with a message that log-message-processor receives from the queue.

Distributed tracing my app!

All right, let’s take a look at the results of our little experiment. The idea was to get better observability of the system, and here what I got after poking around for some time:

What I see here? All my activity: when (and how fast) I logged in, what kind of API calls were incurred by my actions. Scale this to a couple of thousands of users and you will feel the power in your hands :).

Let’s take a closer look at a trace of some operation:

First comes the span with name todos-api, it represents in fact interaction between frontend and todos-api:

Then comes “whitespace” between todos-api and log-message-processor spans: how long the message spent in Redis queue.

Finally, log-message-processor took ridiculously long for some reason. Fortunately, it’s async operation, so end user doesn’t feel it.

Looks pretty helpful, isn’t it? To get even more awesomeness from distributed tracing, let’s imagine what questions can it help to answer.

1. Which services my distributed app consists of?

That’s right. When you have enough tracing data collected, you can render it into huge dependency diagram displaying all your services and interaction between them. Zipkin gives it to you for free. Some other tools (e.g. commercial Instana) may even display such a diagram almost in real time.

Dependency diagram by Zipkin

2. Is there any violation in the architecture?

For instance, this particular toy-app requires log-message-processor to record no more than one log entry for every operation. Duplicated log entries for the same operation can be confusing, not to mention that they waste storage space.

Current setup makes Redis broadcast messages to all listener. Thus, if I launch another log-message-processor, there will be duplicates. Let’s take a look at it:

Oops! 2 log writes for the same POST action! Time to trigger an alarm and investigate what’s wrong :).

3. Was there an action that took too long?

I’m not going to define “too long” here, because for different projects it means different things and actually gets us to SLA and SLO topics which should be discussed in another article. Let’s just take a look at simple example of CREATE action:

It took ~2000ms in total and the majority of time was spent by log-message-processor. That tells us that logging operation should be investigated and, perhaps, improved. Moreover, if logging operation affects our dear customer, we could define an SLO like “New log messages must become available within 1100ms after the user initiated CREATE operation” and monitor it with distributed tracing. Setting up an alert that sets off when the SLO is breached would be a good idea. Unfortunately, Zipkin does not support alerting out of the box.

4. Was there any failed action?

This is an addition to the previous chapter. If we can monitor latency of the operation, then we should be able to monitor errors, right? Absolutely right! Let’s simulate a failure by switching off todos-api.

Oops! A broken trace! Let’s take a look at the details:

That’s pretty handy, isn’t it? It shows that something bad happened and where it broke. Looks enough to start fixing :)

Conclusion

  1. Distributed tracing is extremely helpful tool. If you’re just collecting logs and metrics, then you’re enjoying 2 of 3 observability aspects. Want to know more about your distributed app? Add some tracing!
  2. Distributed tracing allows you to control SLOs that are difficult to monitor with metrics and logs. For instance, when a business operation spans several services and involves async processing (“Record created by user must be available to read back in 2000ms”), you can collect traces, monitor their duration and then reason on these data.
  3. Distributed tracing also helps to build service dependency diagram! How many times you struggled to remember which services are there and how do they communicate? Most probably there are hand-drawn dependency diagrams, likely poorly maintained and missing some components. Tracing provides you the whole picture automatically.
  4. Don’t spend too much time in front, trying to pick a “proper” tool for tracing once and for all. Start with the simplest solution, build expertise in distributed tracing domain, clarify requirements and only after that — rethink your choice. For instance, you may find that your whole company produces tons of data, so it’s easier to buy a commercial product than support something in house (like Zipkin or Jaeger).

--

--