Distributed tracing — the most wanted and missed tool in the micro-service world.

We, as engineers, always wanted to simplify and automate things. This is something in our nature. That’s how our brains work. That’s how procedural programming was born. After some time, the next level of evolution was object oriented programming.

The idea was always the same. Take something big and try to split it into isolated abstractions with hidden implementations. It is much easier to think about complex system using abstractions. It is way more efficient to develop isolated blocks.

The way we did system architecture design was evolving too:

  1. Client-server N-tier applications
  2. Service oriented architecture
  3. Micro-service architecture

These approaches are based on the same ideas:

  • decouple system into smaller units
  • each unit has its own responsibility — it does one thing and it does it well
  • each unit is available via exposed API but has a hidden implementation
  • all units are isolated blocks — you can develop, deploy, scale and monitor them independently
  • units can interface with each other via RPC calls

This sounds reasonable. We can build complex systems just by adding new small parts or reusing existing ones. We build blocks; then we can build of these blocks anything we want. This is like a “Lego” where you can add new parts (if you need them of course).

But, we always have to have balance — and micro-service architectures are no exception. The more you decouple things the less control you have. All you will know is that you have a cool infrastructure where lots of micro-services speak to each other. At some point you won’t be able to see the big picture at all.

The micro-service world.

This is a typical micro-services architecture. We can see six micro-services with some dependencies (which are possible RPC-calls). Also, after some research we have found some infrastructure usage patterns. Let’s name them “micro-service routes”:

So, here we see 3 micro-service routes:

  1. “New user signup” — implemented with micro-services A, B, C and E
  2. “Place an order” — implemented with micro-services A, B, C and D
  3. “Make a payment” — implemented with micro-services A, B, C, D and F

As we can see:

  • some services are more popular and important (A, B, C and D)
  • some services are not used a lot (E and F)
  • popular services might become a bottleneck for several routes (C and D)

Imagine a situation where your “Make payment” flow becomes slow for some reason. You are collecting payments 3 times slower than usual. You need to see where it gets stuck. Also, what if it slows down only in some particular cases? Say only for customers from Australia, who have unread messages in their inboxes. So, they are able to pay in only 30% of cases. You are loosing money and loyal customers. You know that “Make payment” route includes A, B, C, D and F. At the same time you have no idea what the hell is going on. Sounds like a nightmare, right?

In the classical software world we can easily find problems using tools like debugger and profiler. This is good for monolith applications where everything is running inside one process. But.. what if you application is doing RPC call? Guess what? All that you will see in your super cool APM/debugger/profiler/etc is something like “external call to XYZ-system, time taken 16.5 seconds”. Wow, this is really slow, right? Cool, you’ve just found a bottle neck!

Not too fast.

Will this find be helpful to understand why you app has been experiencing the latency issues? The answer is “Yes”. Will it be helpful to fix the problem and make you app to be fast again? The answer is “No”. What if that “XYZ-system” is doing calls to 3 other micro-services while serving your call? What if those 3 micro-services also do calls to other micro-services? What if you have say 20+ micro-services speaking to each other in many different ways?

This is the world where all your business logic is decoupled into small, isolated, distributed and available over network parts (aka micro-services). The world where you don’t see anything. Welcome!

So, you need to see a big picture:

  • to understand which micro-service is failing
  • which routes are affected by this failing part
  • under which circumstances it is failing
  • how often it is falling

Distributed tracing system.

We know everything about any particular micro-service. But we want to see how they interact as a group in different cases. So, we need a different view.

Remember “Network” panels in Chrome debugger or Firebug? When you want to understand why some page is loading so slow — you open debugger “Network” panel with timeline chart. There you can see which resource blocks the page loading/rendering process.

Firebug “Net” pannel timeline chart

Now, imagine that you have the same but for your backend! In this case you resources are micro-services. The call tree is your micro-service route. Also we can see a timeline! Micro-services are waiting for each other as they do RPC-calls and wait for reply.

So, instead of this:

You will see this:

So, now we can see:

  1. “A” calls “B” and waiting
  2. After “A” received reply from “B” it does call to “C”
  3. “C” spends some small time and then does call to “D”
  4. “D” spends some time and does call to “F”
  5. “F” replies back to “D” pretty fast
  6. “D” receives reply from “F” and spends looots of time doing something
  7. “D” replies to “C”
  8. “C” replies to “A” without doing anything, it replies as soon as reply from “D” arrived
  9. “A” received reply from “C”
  10. Our “Make payment” route finished and we sent final result to a client

Huh, that’s a lot of insights! We had no idea that it works like that. So, do you see where did we spend most of the time? Right, it is micro-service “D”! “A” is waiting for “C” and “C” is waiting for “D”. Micro-service “D” takes about a half of the time spent in this micro-service route. So, no we know that “Make payment” route has a bottleneck and this is the micro-service “D”. The next question we would ask is what are other routes where micro-service “D” plays some role? Is it a bottleneck there too or not? Anyways, now we know how to fix a “Make payment” route right away — just fix micro-service “D”.

I would like to draw your attention to these key points:

  • we can see a micro-service route as a timeline
  • we see a call stack and dependencies between micro-services as a tree
  • this tree grows from left to the right as time goes during a one micro-service route’s life cycle
  • we can see the number of calls to any of micro-service and how chatty interactions are
  • this is the key point — we can see where the time spent most. There is always a bad guy who others are waiting for (remember a lunch time? when you want to go to the burrito-place so bad but you just can’t because you are waiting for some dude?)
  • we can see a network latency on this timeline too! s0, we will know is it a micro-service problem or a network problem
  • this timeline can give you some ideas about how to speed up your micro-service route. I.e. how about doing calls from “A” to “B” and “C” in parallel? Do we really need to wait for “B” before doing call to “C”?

You might ask — how can I get all this? This is what distributed tracing systems give to you. Systems like Google Dapper. The actual implementation of distributed tracing system is another one big topic. I will write a post about it one day for sure, just stay tuned! The most curious readers can find lots of details in a Google Dapper research doc here.

Now, you are in the micro-service world where you can see everything.