Towards Turnkey Distributed Tracing

Antique hardware, pre-standardization: where we are with software instrumentation today.

OpenTracing is a new, open distributed tracing standard for applications and OSS packages. Developers with experience building microservices at scale understand the role and importance of distributed tracing: per-process logging and metric monitoring have their place, but neither can reconstruct the elaborate journeys that transactions take as they propagate across a distributed system. Distributed traces are these journeys.

It’s been decades since the first academic papers about distributed tracing were published. It’s been 12 years since Google started using Dapper internally. It’s been 6 years since the Dapper paper was put online. Zipkin was open-sourced over 4 years ago. This stuff is not new! Yet if you operate a complex services architecture, deploying a distributed tracing system requires person-years of engineer effort, monkey-patched communication packages, and countless inconsistencies across platforms.

If distributed tracing is so valuable, why doesn’t everyone do it already?

Because tracing instrumentation has been broken until now.

Distributed tracing is challenging because the instrumentation must propagate the tracing context both within and between processes. Accomplishing this touches almost every part of an application. In particular, tracing context must be passed through:

  • Self-contained OSS services (e.g., NGINX, Cassandra, Redis, etc)
  • OSS packages linked into custom services (e.g., grpc, ORMs, etc)
  • Arbitrary application glue and business logic built around the above

And there’s the rub: It is not reasonable to ask all OSS services and all OSS packages and all application-specific code to use a single tracing vendor; yet, if they don’t share a mechanism for trace description and propagation, the causal chain is broken and the traces are truncated, often severely. We need a single, standard mechanism to describe the behavior of our systems.

Enter OpenTracing

OpenTracing is that “single, standard mechanism.” OpenTracing allows developers of application code, OSS packages, and OSS services to instrument their own code without binding to any particular tracing vendor. Every component of a distributed system can be instrumented in isolation, and the distributed application maintainer can choose (or switch, or multiplex) a downstream tracing technologies with a configuration change.

So we know we need standardization: but standardization where? All of the following are possibilities:

  1. Standardized span management: programmatic APIs to start, finish, and decorate timed operations (“spans” in Dapper and Zipkin’s terminology)
  2. Standardized inter-process propagation: programmatic APIs to aid in the transfer of tracing context across process boundaries
  3. Standardized active span management: programmatic APIs to store and retrieve the active span across package boundaries in a single process
  4. Standardized in-band context encoding: specification of the precise wire-encoding format for tracing context passed alongside application data between processes
  5. Standardized out-of-band trace data encoding: specification of how decorated trace and span data is encoded as it heads towards the distributed tracing vendor

Previous efforts to standardize distributed tracing have focused on the encoding and representation of trace and context data, both in-band and out-of-band (numbers 4 and 5 above). Ironically, per the table below, standardization of encoding formats has few benefits for instrumentation API consistency, tracing vendor lock-in, or the tidiness of dependencies for OSS projects: the very things that stand in the way of turnkey tracing today.

What’s truly needed — and what OpenTracing provides — is standardization of span management APIs, inter-process propagation APIs, and ideally active span management APIs. The table below shows the five key benefits of standardization around distributed tracing, and the importance of each of the five kinds of standardization for each of those benefits.

OpenTracing picks its battles: by focusing on APIs rather than implementation encodings, it provides the standardization benefits that developers actually need.

OpenTracing gives us the standards we need to make accurate, turnkey distributed tracing a reality across modern software systems, including OSS packages and other third-party code. It does this while allowing OpenTracing implementations to take control over both in-band and out-of-band encoding formats; this in turn gives the application owner the flexibility to switch (or add) tracing vendors with an O(1) declarative configuration change.

The OpenTracing software architecture. Application code and OSS packages program against the abstract OpenTracing APIs, describing the path that requests take within each process as well as the propagation between processes. OpenTracing implementations control the buffering and encoding of trace span data, and they also control the semantics of process-to-process trace context information. As a result, application code can describe and propagate traces without making any assumptions about the OpenTracing implementation.

This summer we will be announcing integrations with important pieces of the application-level microservices stack. As OpenTracing proliferates into other popular packages and subsystems, we will be able to have our cake and eat it, too: high-quality distributed traces with little to no instrumentation effort by the application programmer.

Thanks for reading, feel free to recommend this post if you found it valuable. If you’d like to learn more about OpenTracing,