Myth: service mesh can do distributed tracing of your application

Yuri Shkuro
6 min readAug 31, 2022

This is one of those myths that just wouldn’t die. It goes like this:

A service mesh product can enable distributed tracing of your application, without you having to change any code.

This message was used as a marketing ploy by service mesh products in the early days. These days, they give a more honest and nuanced picture. For example, Linkerd documentation (for v2.12) says (emphasis is mine):

Linkerd’s role in distributed tracing is actually quite simple: when a Linkerd data plane proxy sees a tracing header in a proxied HTTP request, Linkerd will emit a trace span for that request. This span will include information about the exact amount of time spent in the Linkerd proxy.

In other words, if your application is already participating in a distributed trace, Linkerd will log information about the request. Logging is a part of tracing, but it’s not the whole picture.

Envoy’s documentation (for v1.24) is less straightforward, but also says at the end (emphasis is mine):

Envoy provides the capability for reporting tracing information regarding communications between services in the mesh. However, to be able to correlate the pieces of tracing information generated by the various proxies within a call flow, the services must propagate certain trace context between the inbound and outbound requests.

That still feels like a fine print to me, so it’s not surprising that the myth lives on, since someone just perusing the documentation or even looking through the introduction will find statements like “Envoy also supports distributed tracing via third-party providers” and can walk away with the wrong impression.

Can service mesh provide distributed tracing?

The answer is simple: they cannot, on principle. I am not aware of any technology that can implement distributed tracing by treating your application as a blackbox (which is what the service meshes are doing). All modern distributed tracing platforms are based on metadata propagation: as a request (or more generally, a workflow) is being executed in multiple services, a certain piece of metadata (usually called trace context) is being passed from service to service. It is this propagation of trace context that allows us to correlate different tracepoints along the request path and assemble their logged data into a single trace (learn more about the superpower of context propagation as a mandatory feature of distributed systems: Embracing Context Propagation).

Context propagation in an instrumented application.

This is a diagram from my book, Mastering Distributed Tracing, that illustrates how the tracing works in an instrumented application. If we assume the caller has already started the trace and passed the trace context via request headers, step (1) is a service handler that is wrapped in a tracepoint instrumentation that (a) reads the header and converts it into an in-memory Context object that contains the trace metadata, and (b) logs the tracing span to a tracing backend. (2) As the service does its job, the Context object is being passed around the functions, sometimes explicitly as a function argument (e.g., in Go apps) or as a thread-local state. (3) If the service needs to make a downstream call to another service or a database, the call site is again wrapped into a tracepoint instrumentation that serializes the Context into a wire representation, like a request header, as well as logs another tracing span to the tracing backend.

Note that if you want to rely on service mesh generating the tracing data, then the instrumentation tracepoints are not required to log the spans, but they are still mandatory to ensure that the trace context is passed from the inbound to the outbound requests. In practice, ensuring that the application is correctly propagating the context is a much harder task, so if you already solve for it, you might as well throw in the logging of the spans at the tracepoints.

So why can’t we do this as a blackbox approach? Why can’t service mesh propagate the context for us?

The simple answer is: because of concurrency. If your service is only capable of handing one request at a time (which is still a popular technique, for example with Django apps), then yes, we can easily correlate an inbound request to the process with all outbound requests and generate the trace data externally. But most microservices today are built to handle many concurrent requests, which makes it nearly impossible to generically correlate which of the many outbound/downstream requests correspond to which inbound requests, in order to be able to attach the correct trace id to the outbound requests.

Without an in-process context propagation, distributed tracing does not work.

Is there value in service mesh’s tracing?

Value is in the eye of the beholder. Service mesh is a single piece of software deployed to wrap communications of many heterogenous services. This provides a great point of leverage to produce consistent telemetry that can be easily correlated, e.g. metrics tagged with an endpoint name are guaranteed to have the same representation of the endpoint as the traces, if both are produced by a service mesh. The same is much harder to achieve when the telemetry is produced by different libraries in different languages.

Another benefit is due to the fact that a service mesh itself is a piece of distributed architecture, which can introduce its own delays and issues. So having tracing spans that provide visibility into this layer of infrastructure can be useful in troubleshooting.

Example of a trace that is a mix of spans produced by the application itself and by Istio service mesh.

However, there are downsides to this as well. First one is the increased data volume, as service meshes tend to generate new spans in the trace. Instead of seeing parent and child spans representing A → B call pattern, we see A → service_mesh → B, which makes the traces more noisy and difficult to read. The other downside is again a consequence of the service mesh treating your app as a blackbox — it simply cannot generate tracing data as rich as the application can log itself. Imagine your service is sending a GraphQL request in an HTTP POST message. Inside your service, you’re probably using an instrumentation that has much better understanding of the structure of this request and can capture intricate details in the spans, like using the GraphQL query name as the client span name. But all that the service mesh is seeing is an HTTP POST request (and for performance reasons, it’s unlikely that it would want to parse the request body to extract the GraphQL query name), so the span it can produce is going to contain much less interesting data. Even for the endpoint example I mentioned earlier, it gets tricky when you call a REST service, because the URL may contain various IDs not suitable as an endpoint name. So many service meshes have conventions that suggest that your application adds an HTTP header indicating the concise endpoint name, purely for the mesh’s internal purposes, since the header has no value to the called REST service.

Takeaways

  1. Service meshes cannot magically generate traces for your applications. They can only add data to the traces if your application is already propagating the context (and with 1% extra work it can be logging the spans itself).
  2. The benefits of the spans produced by the service mesh are somewhat mixed. I’ve met many practitioners who tried it both ways and would recommend foregoing the service mesh tracing and only keeping the application-level tracing data.

Shameless plug: my book, Mastering Distributed Tracing, has a chapter on service meshes with a working example of enabling tracing via Istio. The code & configuration is on GitHub.

--

--

Yuri Shkuro

Software engineer. Angel investor & adviser. Creator of tracing platform Jaeger. Author of “Mastering Distributed Tracing”. https://shkuro.com/