A CTO’s strategy towards OpenTracing

tl;dr Moving towards standards for adding trace to code can have a massively positive effect on observability of software. Right now we are at the very beginning of this journey. Unless you are building a framework or middleware you should not actively work with OpenTracing, and even then the goal would be to expose trace data via APIs to pre-existing agents and/or backend collection tools.

The problem(s) we need to solve as an industry

Tracing historically has been implemented by different tools — commercial and open-source — independent of each other. Interoperability has never really been a concern. Tracing tools were and often still are, tied to a specific application, so there hasn’t been the need for interoperability.

However, with the rise of cloud computing, this is starting to change. Today’s applications rely on cloud services provided by third parties. The only way to get end-to-end visibility is to combine trace data from different tracing providers. So interoperability is suddenly becoming important and as an industry we need to solve two very important problems:

  • The ability to create an end-to-end trace across multiple tool boundaries;
  • The ability to access partial trace data in a semantically well defined way and link those data together for end-to-end visibility.

Unfortunately, OpenTracing doesn’t address these challenges. OpenTracing is an API definition that enables you to manually add tracing functionality to your code. Before diving deeper into this topic. Let’s look at how we can solve the tracing-tool interoperability problem.

Making tracing end-to-end (again)

The biggest challenge we face today in terms of delivering end-to-end tracing visibility is the heterogeneity of applications and environments. Such complexities are an even bigger problem in highly distributed microservices environments.

APM vendors and cloud providers are aware of this problem. They’re working together to solve this issue by agreeing on two points:

  • A standardised means ofpropagating trace context information of multiple vendors end-to-end.
  • Discussing how to be able to ingest trace fragment data from each other

The first problem is on the way to be resolved within the next year. There is a W3C working group forming that will define a standardised way to deal with trace information referred to as Trace-Context, which basically defines two new HTTP-Headers that can store and propagate trace information. Today every vendor would use their own headers, which means they will very likely get dropped by intermediaries that do not understand them.

Action Item: If you’re building messaging middleware or a library for inter-service communication, take a look at Trace context to understand if it will have an impact on your implementation.

Now let us move on to data formats. Unfortunately, a unified data format for trace data is further away from becoming reality. Today there are practically as many formats available as there are tools. There isn’t even the conceptual agreement whether the data format should be standardised or if there should be a standardised API and everyone can build an exporter that fits their specific needs. There are pros and cons for both approaches and the future will reveal what implementors consider the best approach. The only thing that cannot be debated is that eventually we will need a means to easily collect trace fragments and link them together.

Action Item: If you are building middleware like a service bus or a (cloud) service and want to provide insight into the inner workings for optimisation and debugging purposes, you’ll need to define a way for how this data can be accessed by tracing tools.

Once these two problems are solved end-to-end tracing in highly-distributed environments will become a reality. OpenTracing does not play any role in this.

So what do we need OpenTracing for then?

Over the course of many conversations, I’ve learned that people have many misconceptions about what OpenTracing is. Before we dive into the details, let’s examine what OpenTracing is not. I’ll explore each of these points in detail:

  • OpenTracing isn’t a standard.
  • OpenTracing isn’t a cross-vendor implementation of tracing.
  • OpenTracing isn’t a tracing system.
  • OpenTracing isn’t APM.

Why OpenTracing is not a standard

OpenTracing is an open source project governed by the Cloud Native Computing Foundation (CNCF). As the CNCF states itself it is curating open source project — most prominently Kubernetes. The CNCF makes it very explicit that it is not a standardisation body so OpenTracing cannot be a standard. According to opentracing.io, Opentracing is a vendor-neutral open standard for distributed tracing. This is misleading for two reasons. First in order to be a standard there would need to be standards body which the CNCF simply isn’t. Second a standard must be vendor-supported, not vendor-neutral. The whole idea of a standard is that a lot of — often competing — vendors agree on something to give the rest of the world the security to adopt and apply technology without being handcuffed to a proprietary technology.

HTTP for example is a standard. It is defined by a standards body — IETF in this case and all relevant vendors agreed to support it.

The progress of agreement obviously is hard and that’s why standard development takes a lot of time, blood, sweat and tears. As of today most APM don’t support OpenTracing-based instrumentation, which means it does not serve the main purpose of a standard — at least not yet.

I was recently asked why most APM vendors — a.k.a. the industry — do not support OpenTracing as the industry standard. It should be obvious that this question does not make sense.

In the future, however, there will be a standard for defining tracing and the APM and cloud vendors have taken the first step to move into this direction. Realistically though, many moons will pass until this manifests. Again, standards are hard to define and they take time.

Why Open-Tracing is not a cross vendor implementation

This should now be answered as well. The simple reason is that the big three in the space — Dynatrace, New Relic and AppDynamics do not fully support it. So if your main motivation for OpenTracing is to switch between APM vendors, OpenTracing won’t help.

In reality, OpenTracing does not even matter in this context — at all. As Jonah Kowall stated in a recent write-up on OpenTracing, APM vendors are already today very easy to exchange and enterprise do it all the time.

All modern APM tools use automatic instrumentation — rather than touching any code. There is zero dependency on a vendor in your application and just by changing a small piece of configuration you can have your application report data to a different tool in literally minutes.

As an application developer there is almost no need to write any tracing code. The only exception is adding some domain specific data like user names etc. and even in these cases APM tools provide means like deep object access to retrieve them automatically.

Why OpenTracing is not a tracing system

This one should be clear by now. OpenTracing is an API definition. If you want to actually collect, analyse and act on data you need agents and a backend to do so.

So your strategy will need to include tools that need to do the actual work as well. CNCF just recently added Jaeger which is an open-source backend to OpenTracing. You can also use Zipkin and some other providers.

If having a self-contained tracing system that is independent of the big vendors you might rather want to look at OpenCensus — a Google project that provides all the data collection parts you need to collect the data. You will still need to decide which backend you want to use.

Closing the loop now to the beginning of the article, the opencensus project will help especially middleware providers to enable end-to-end tracing in modern distributed environments

Action Item: If you are building middleware like a service bus or a (cloud) service and want to provide insight into the inner workings and have not tracing today look whether OpenCensus can help you to quickly and easily achieve this.

The very nice thing about OpenCensus is also that it supports metrics in addition to traces. In the CNCF ecosystem this is handled by a separate project — Prometheus.

Why isn’t OpenTracing APM?

As I started to point out in the last section there is more to observability than tracing alone. Metrics and logs are key to get a full picture of what is going on in your environment.

Even with all this data there is much more to an APM system. Data collection, storage and visualisation no longer define an APM system. Today’s systems are differentiated by the analysis they provide on top of the data and how the analysis results can be used to better run a customer’s application fabric.

At Dynatrace, for example, we started to use AI-based algorithms four years ago to be able to analyse performance problems and automatically identify the root cause. Today this is deeply woven into runbook automation tools and PaaS platforms to enable entire application infrastructures to run automatically.

If you look at other key industry players, you will notice that everyone is focussed on what to do with the data rather than how to collect it. This, by the way, is also the reason why vendors are now open to collaborating on standardising data collection. It has — at least to a certain extent — become a cost factor more so than a differentiator.

So, OpenTracing does not make any sense?

Wait, I did not say that the idea of having a standardised approach to define tracing — and metrics, and logging … — instrumentation to code does not make sense. Quite the contrary. I do think that it makes a lot of sense.

Today APM providers reverse engineer frameworks so that they can add instrumentation support to them. This is costly and time consuming. It is also done in parallel by all the vendors. A lot of energy is put into adding observability by a lot of people. If we can find an easier way to do this that would be amazing. APM vendors can focus more on the analytics on top of the data and the expertise of the actual framework implementors would be used to get proper instrumentation. This is by the way also the long term goal of opencensus.

I envision a future where frameworks come with observability built-in as they come with logging built-in today. This, however, will only happen when there is an standardised and easy way for framework implementers to do this.

Action Item: If you are implementing a framework look at OpenTracing and OpenCensus to understand how this can help to provide better insight into the inner working of your code.

So, it is OpenTracing vs. OpenCensus now?

It is — kind of, for now. Hopefully, there will eventually be a “together” rather than an “against” approach and the two projects will converge. If we as an industry really want to achieve the goal of a unified way of tracing there needs to be one single approach. How we get there will depend on the people working on this future. I would love to see Google submit OpenCensus to CNCF and then see CNCF follow their mission of fostering collaboration between the industry’s top developers, end user and vendors to integrate open source technologies (also part of their mission).

Action Item: Familiarize yourself with the entire observability ecosystem and consult your current vendors on their strategy to define your own path forward.

Can we get a bit more technical?

Yes, I will dive into more technical details in future posts. But first, I want to wait for the outcomes of a soon-to-happen workshop on tracing APIs first.

Concluding

I might have made your world a bit more complicated than it was before. For some I have taken a silver bullet to solve your observability problem away and instead given you a bit of homework.