Why you need Tracing in your distributed systems
Don’t lose the trees for the forest
Correlation Tracing, it’s not just for microservices
Over the past couple of months, I’ve been doing a lot of writing about microservice architectures, which is, clearly, a “flavour of the year” architectural pattern. However, I’ve gotten feedback from readers that not everyone is there yet, and in many cases, may never go there.
That’s fair. Microservice architectures have many distinct advantages over more traditional, more monolithic systems, and I don’t think I need to expound any further on the whys of this being so. With that in mind, lots of devs and engineers are working on legacy systems, or on systems which just don’t suit the microservices architecture.
So instead of just talking microservices, I’d like to focus in some of my posts on things which are applicable not only to microservices, but to distributed systems as a whole. While a given organisation may never choose to go down the microservices path, they assuredly have distributed systems.
Today, I’d like to talk about correlation / distributed tracing and how it can save you from the nightmare of a broken distributed system.
First off: What’s a distributed system?
Let’s just clarify our terms up front, so we all start on the same leg. A distributed system, or distributed computing, is a pretty common architecture these days.
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. The components interact with one another in order to achieve a common goal.
In essence, imagine you have one or more web apps / components / doohickeys which work together through APIs / messaging and events to coordinate their actions, each one playing some role in a common goal. Microservices are a form of distributed systems, but certainly not the only kind.
Second up: What’s the problem we’re trying to solve?
Start with a scenario: You and your team have deployed the most recent changes to your application stack. You have a new feature which lets a user send an invitation to one of your other users to join a group, the other user clicks a link, and presto, this second user is now joined up with the group he/she was invited to join.
Now, this is clearly NOT a single process. You could try to build this as a single process, but far easier would be to build this as a series of interim steps, each one accomplished by some subsystem of your overall platform.
This is stupidly simple, but it gets the point across
Now, as I’m sure you’ve worked out, each of these coloured boxes is a process which runs as part of your distributed system. As such, any one of them is subject to failure / errors. Like this
Any time you run a process flow through a distributed system, you are at risk that some joker system in the flow will fail. Things go wrong. That’s part of how it works. Why would your system (or better put, some part of your system) fail? Who can say? This, that, the other thing. It doesn’t actually matter WHY some part fails, the fact is that something always does. So now what?
I’ve heard this one before: I’ll just jump on and read the logs. Sounds simple, right?
Before you answer, let me exacerbate this nastiness for you. Each of those coloured boxes is not a single machine (don’t be silly). Each one represents a process, which could be running on any number of clustered machines. Serverless or not, microservice or not, you could have anywhere from 1 to a couple hundred (thousand?) little clones of your service processing this part of the flow.
Now, in any standard system, those coloured boxes are not only clustered but load-balanced, so that with just three services, you could have hundreds of possible combinations of which ServiceA calls which ServiceB calls which ServiceC.
I think you’d agree: That’s a LOT of log files to trawl through. Even if you DO find the right log file, how long will it take for you to know it’s the right one? You can start searching for Invitation or User Ids, but what sort of nightmare is that? You have to find the right details on the successful machines, then parse through the logs of countless other machines to get down to finding the actual failing service which bombed out. Even using the likes of Splunk or Kibana, you are in for a fun little ride down log parse lane, and that is never a fun trip, to be sure.
As a DevOps team, we want to get this thing fixed as soon as possible. Not only do we have unhappy users, but we have a failing process in our system which is making us look like simpletons. We need to be able to work out exactly who in the chain failed, and why. We want to be able to remove the offending service if it is failing regularly, and at the very least, we want to be able to see the chain of events which took us from the first slob sending the invite to the second slob never receiving it.
So what do we do?
Enter Correlation / Distributed Tracing
Distributed Tracing (sometime referred to as Correlation Tracing) is a means by which you can save yourself a LOT of unnecessary log looking, and get right down to the root of the problem.
- Assigns each inbound request a unique external correlation id (think of this as a unique identifier for the invitation process taking place between our Slob and our Second Slob)
- Passes the external correlation id to all services that are involved in handling the process flow
- Includes the external correlation id in all log messages
- Records information (e.g. start time, end time) about the requests and operations performed when handling the process flow in a centralized service (CloudWatch, Splunk, DataDog, Kibana, etc.)
Now, with a small amount of effort on our part (I’ll come to some common Distributed Tracing tools you can use in your code), you now can get something along these lines
Now we’re on to something! You see, now you are no longer held hostage by countless log entries. You have the ability to use our correlation id as a means of working out where in the chain things went (to use a fun aphorism) pear-shaped. Because the correlation id follows along in our flow, we have something we can use to find our broken little culprit.
When we bring distributed tracing to bear using log file aggregation, we have ourselves a very fast way of working out where things fell apart.
Distributed Tracing Libraries
There are any number of distributed tracing libraries at your disposal.
- AWS X-Ray For those on AWS, this beauty works in conjunction with CloudWatch and a custom visualisation tool to make distributed tracing easy.
- Azure Application Insights On Azure, you can use Application Insights library to build distributed tracing into your apps.
- OpenCensus is an open-source telemetry / distributed tracing platform for a wide variety of languages.
- OpenTracing OpenTracing is another open source initiative for creating telemetry / distributed tracing in your application. OpenTracing is pretty rough and ready, but they seem to be headed in the right direction.
Distributed Tracing Aggregators
Once you’ve got your apps creating distributed traces, you need a tool which can bring them together for you. This is a big topic, as what you’re now doing is bringing together all this trace information into a meaningful, centralised source. I’d highly recommend you review these tools in-depth, to ensure you are using the right one for your organisation. More than likely, if you have a decent-sized platform already, your organisation has something which can probably help.
- Splunk Splunk has a variety of tools to help you with distributed tracing. With that said, like all things Splunk, you have to do a good amount of technical work to get back what you’re after. Very powerful, but not my personal preference (no offense intended. Splunk has some amazing power behind it)
- AWS X-Ray I’ve used X-Ray at Coactive for our API and SignalR Hub, and it does a nice job. If you are using native AWS serverless tech (like Lambdas), then you just need to call their X-Ray library for your traces, and it takes care of a lot of the boilerplate stuff for you. If you build your own containers or EC2 instances, you will need to provide an X-Ray Daemon (either on the machine itself or as a container), which will pass along the details of logs to X-Ray UX.
- DataDog Out of the box, DataDog has tracing capabilities ready to go. You just need to wire it up, and you are ready. I’ve never used DataDog myself, but I have heard good things from people who’ve used it. I’m actually considering it for my next project. Or perhaps…
- ZipKin This is a new one to me, but again, I’ve talked with people who love it. Out of Twitter, and based on a Google internal distributed tracing system, ZipKin has a good-sized following. Here’s a great article on the topic by Joab Jackson
If you’re new to the world of distributed tracing, this is a lot to take in. In a world where lots of devs still kill themselves digging through log files and wishing someone would drive a bus through their cubicle and take the pain away, distributed tracing can feel like a massive effort to get up and running.
And yes, it is a decent investment to make this a reality. With that said, start small. Pick one small flow in your app and try it out. I think you will be very happy with the results. While everyone else around you is struggling to make any sense of what’s gone wrong, with your one small flow, you’ll know at a glance what is going wrong.
Way to make yourself a superstar.
Hope this helps.