Embracing context propagation

This post illustrates some practical examples of distributed context propagation. More detailed examples are presented in Chapter 10 of my book, Mastering Distributed Tracing.

Introduction

One of the first thing to go as we make the software systems distributed is our ability to observe and understand what the application as a whole is doing. We are usually well-set with tools telling us what each individual component is doing, while leaving us in the dark about the big picture. Distributed or end-to-end tracing is one of the tools that can shine the light on the big picture and provide both the macro-view of the system, as well as the micro-view of each individual request executed by the distributed application.

There are different approaches used to implement tracing infrastructures. I cover a number of them in Chapter 3 of my book, Mastering Distributed Tracing. The most popular approach, used by almost all production-grade tracing systems today, is to pass certain metadata (aka trace context) along the path of the request execution, which can be used to correlate performance data collected from multiple components of the system and reassemble them into a coherent trace of the whole request. The underlying mechanism used to pass that trace context is called distributed context propagation.

Distributed context propagation is a generic mechanism that can be used for purposes completely unrelated to end-to-end tracing. In the OpenTracing API it is called “baggage”, a term coined by Prof. Rodrigo Fonseca from Brown University, because it allows us to attach arbitrary data to a request and have this data automatically propagated by the framework to all downstream network calls made by the current microservice or component. The baggage is carried alongside the business requests and can be read at any point in the request execution.

In this article I want to describe some of the practical examples of using distributed context propagation.

Labeling synthetic traffic, or “test tenancy”

In large systems built with microservices it is often impractical to support staging environments that have the same composition of components as real production and anything evenly remotely resembling the real production traffic. As a result, many organizations are turning towards testing in production, which involves deploying code under test in the midst of your production microservices mesh and pushing real traffic through it. In other cases we want to send test or synthetic traffic to our production instances, perhaps as a form of monitoring, i.e. to ensure that core business functions are being correctly performed, or to perform stress test to assess capacity.

Tagging this synthetic traffic with a certain label via distributed context propagation provides a number of benefits. Imaging you own a microservice that is being bombarded with such synthetic traffic and causing errors. If you follow good DevOps practices, you may have monitoring in place that measures the rate of errors produced by the service and alerts you when the rate exceeds a certain threshold. But you are only interested in getting this alert if production is being affected, not because some upstream service runs a stress test. By using the “synthetic traffic” tag from distributed context you can partition your error rate metrics into two time series, one for prod another for tests. Then you can set different thresholds on those time series.

We can also use the test tenancy (another name for the synthetic traffic label) to make our applications react differently to the requests. For example, a specially instrumented Kafka client may redirect messages with test tenancy to another dedicated cluster, to avoid messing up production topics. Similarly, a storage tier can redirect the requests with test tenancy to a read-only cluster.

Using distributed context for observability purposes (partitioning of metrics) is commonly accepted as a useful practice, since in this case the context is not used to affect the behavior of the application itself. The last example may cause discomfort to some people, since now we are using the context as a control function. However, if we restrict distributed context propagation from being used in this way, then people would find other ways of passing this information, only in a more ad-hoc, less reliable manner. Implementing context propagation correctly is not a trivial thing, especially given the many different threading and async programming models available to developers. I think it is better to have a single propagation framework that is well tested and maintained, than to leave it to ad-hoc implementations.

This illustrates the cross-cutting nature of distributed context propagation. We can always implement the same functionality without context propagation, by extending the APIs of our components to accept the desired parameters. But try to imagine how this would look in practice. Take the stress test example again and imaging you are a storage tier 5–6 levels down from the service running the test. For your storage service to know that it’s getting a test traffic, you would need to go through all those layers and negotiate with those developers to extend their APIs to capture and pass the tenancy parameter. Imaging having to do it in an organization that runs thousands of microservices! Such approach is infeasible in practice, as well as very rigid, since as soon as you need to pass another parameter you need to repeat the whole process again.

The power of distributed context propagation is that it achieves the propagation of metadata without any changes to the services that pass this metadata.

Attributing resource usage to lines of business

If your company runs multiple lines of business (think Gmail vs. Google Docs vs. Google Calendar, etc.), how can you tell how much of your hardware and infrastructure spend is attributed to each LOB? Higher in the stack it’s probably not difficult, since those LOB might have dedicated sets of microservices responsible for their distinct business functions, and you can directly measure how much compute resources those services require. However, as we move lower in the stack, to shared systems like storage or messaging platform, it becomes much more difficult to partition the spend on those systems by LOBs.

Context propagation to the rescue! When the request enters our system, we typically already know which LOB it represent, either from the API endpoint or even directly from the client apps. We can use context (baggage) to store the LOB tag and use it anywhere in the call graph to attribute measurements of resource usage to specific LOB, such as number of reads/writes in the storage, or number of messages processed by the messaging platform.

Traffic prioritization / QoS

Since the LOB traffic labeling was, again, mostly used in the “observe” function (measurements and metrics), let us consider another application of context propagation in the “control” function. Modern applications have many workflows that users can follow through the app. Not all of those workflows are of equal value and criticality. In a ride-sharing app, we could say that a request to take a trip is more important than a request to add a location to your favorites. However, when these requests eventually reach a shared infrastructure layer like storage, these criticality distinctions would generally be lost already. So if the storage tier is suddenly experiencing a traffic spike and needs to prioritize the handling of the request, it would be good if it could do that based on their criticality to the business. We can use context propagation to achieve that, for example, by assigning different tier numbers to all API endpoints and propagating those numbers in the context through the call graph. These tier labels can be used by different infrastructure layers to provide different quality of service based on the request criticality.

Implementing context propagation

The OpenTracing API provides a standard mechanism called “baggage” to capture, propagate, and retrieve distributed context metadata. Distributed tracing platform like Jaeger fully support baggage and any application instrumented with OpenTracing and using Jaeger client libraries can benefit from it.

I am also working with W3C Distributed Tracing working group to develop a standard format for propagating distributed context via HTTP and other transports (see https://github.com/w3c/correlation-context).

Summary

Distributed context propagation is a very powerful technique, especially in the new age architectures built with many microservices. It allows is to associate an arbitrary metadata with each request and make it available at every point in the distributed call graph transparently to the services involved in processing that request. In this post I discussed a few of the examples of using context propagation to solve real problems. There are many other examples, such as:

  • Pivot tracing
  • Passing fault injection instructions for Chaos Engineering
  • Testing in production
  • Debugging in production
  • Developing in production

To see detailed descriptions of these use cases, check out Chapter 10 of my book, Mastering Distributed Tracing.