Explained: An Introduction to Distributed Tracing
When we are running a distributed system that spans multiple containers and data centers, we need to understand how user requests flow through these services. This enables us to decipher the larger picture of where the request spends most of its time and figure out which service is not working as intended.
Most services communicate with each other through a fixed set of protocols such as HTTP (as REST calls), RPC, or a queuing-based mechanism, where one server acts as a producer and the other as a consumer. For this article of the Explained series, we will be explaining how we can implement distributed tracing when services are communicating with each other through REST calls.
Let us assume we have two services, An order service to provision the order and the invoice service to generate an invoice.
Just a simple header!
One way we can achieve this very easily is by adding a header to every request from our order service to invoice service. Let us call it
trace and it can be a simple UUID so that it is pretty unique. We can add a request interceptor or filter to add the trace to the log context. Even though we are logging the unique ID across all the services a request passes through, we still can’t see the combined logs for an entire API call between our two services. We need a way to collect logs of both applications and a Filebeat log shipper can help us with this. It is deployed as a separate service to collect logs from multiple containers and applications, and index them in an Elasticsearch cluster and visualize it through a Kibana dashboard. This setup will allow us to easily query the logs of both applications by executing a simple query.
While the above system is a good start, it is just the beginning. What if the order service calls the invoice service multiple times in the same API call? We cannot solely rely on the trace ID. We need to generate another ID that represents a single span of operation in this entire trace. It can be another UUID but must be different from the trace ID. Let us call this the span ID.
With the above modification, the order service now creates an overarching parent
trace and sets it to its log context and for every API call to the invoice service, the order service sends that parent trace ID through a header
traceparent. The invoice service now creates a
span ID and sets it to its log context and continues to perform any necessary operation under the span ID. Once the invoice service has completed its job, it returns the response along with the span and parent trace ID to the order service as a response header named
traceresponse. The order service audits the response and marks the end of that span. Finally, the trace is marked as complete when the order service returns the response to the client.
The above mechanism allows us to search for a particular trace and span ID combination with a distinctive end of span facilitated through the response headers. It would also allow us to verify if the invoice service truly participated in the tracing or not.
Not just a simple header anymore!
Up until now, we have only observed and solved use cases that may involve multiple API calls amongst the services. But what about the non-functional use cases? What if there is already a system that uses a different (non UUID) span or trace identifier. We need to version our tracer generation formats and add some meta information like the version number in the trace header to account for the different versions. It would provide context to the services to handle the traces for different versions.
Secondly, What if we have a third service that calls the invoice service but does not participate in distributed tracing. We need a way to understand if the caller is sampling the tracing information or not. Hence, we need some flag bits to represent various predicates to consider during the trace.
Lastly, we might run into situations where we will have to pass more than just the tracing version, but a complete state map. This means that we should accommodate another header
tracestate to hold such data. In essence
traceresponse , and
tracestate headers should round out our basic RESTful distributed tracing.
We have only described the possibilities in a synchronous RESTful system. Asynchronous systems pose their own unique challenges as chronologically, the end of the span from the invoice service might not coincide with the end of the trace. It opens up our model to a whole array of modifications. Luckily, W3C has already outlined rules, described solutions, and developed standards along the same line with what we have described above.
There are already existing open-source projects that adhere to the W3C standard, like Jaeger, Zipkin, and AWS X-Ray. We can find Cloud Native Foundation projects that concentrate on the same with more and more companies adopting micro-services. In our upcoming Explained articles, we will look at Jaeger and OpenTracing to see how distributed tracing has been implemented on a more practical basis.