Experiment: Migrating OpenTracing-based application in Go to use the OpenTelemetry SDK

Yuri Shkuro
JaegerTracing
Published in
8 min readFeb 9, 2023

TL;DR: This post explains how Jaeger’s 🚗 HotROD 🚗 app was migrated to the OpenTelemetry SDK.

Jaeger’s HotROD demo has been around for a few years. It was written with OpenTracing-based instrumentation, including a couple of OSS libraries for HTTP and gRPC middleware, and used Jaeger’s native SDK for Go, jaeger-client-go. The latter was deprecated in 2022, so we had a choice to either convert all of the HotROD app’s instrumentation to OpenTelemetry, or try the OpenTracing-bridge, which is a required part of every OpenTelemetry API / SDK. The bridge is an adapter layer that wraps an OpenTelemetry Tracer in a facade to makes it look like the OpenTracing Tracer. This way we can use the OpenTelemetry SDK in an application like HotROD that only understands the OpenTracing API.

I wanted to try the bridge solution, to minimize the code changes in the application. It is not the most efficient way, since an adapter layer incurs some performance overhead, but for a demo app it seemed like a reasonable trade-off.

The code can be found in the Jaeger repository (at specific commit hash).

Setup

First, we need to initialize the OpenTelemetry SDK and create an OpenTracing Bridge. Fortunately, I did not have to start from scratch, because there was an earlier pull request #3390 by @rbroggi, which I picked up to make certain improvements. Initialization happens in pkg/tracing/init.go :

In the beginning, this is a pretty vanilla OpenTelemetry SDK initialization. We create an exporter using a helper function (more on it below), and build a TracerProvider. The compliant OpenTelemetry instrumentation would use this provider to create named Tracer objects as needed, usually with distinct names reflecting the instrumentation library or the application component. However, the OpenTracing API did not have the concept of named tracers, its Tracer as a singleton, so here we create a Tracer with a blank name (in line 23) and pass it to the bridge factory that wraps it and returns an OpenTracing Tracer.

Side note: in a better-organized code there would also be some sort of close/shutdown function returned so that the caller of tracing.Init could gracefully shutdown the tracer, e.g. to flush the span buffers when stopping the application.

The original PR used the Jaeger exporter that lets the SDK export data in the Jaeger’s native data format. However, last year we extended Jaeger to accept OpenTelemetry’s OTLP format directly, so I decided to add a bit of flexibility and make the choice of the exporter configurable:

Broken Traces

At this point things should have started to work. However, the resulting traces looked like this:

Trace with many spans, but all coming from a single service `frontend`.
Another part of the workflow captured as a different trace. It looks like there are two services here, but in fact the HotROD app simulates the `sql` and `redis` database services, it’s not actually making RPC calls to them.

Instead of one trace per request we are getting several disjoined traces. This is where @rbroggi’s PR got stuck. After some debugging I came to realize that the SDK defaults to a no-op propagation, so no trace context was sent in RPC requests between services, resulting in multiple disjoined traces for the same workflow. It was easy to fix, but it felt like an unnecessary friction in using the OpenTelemetry SDK. I also added the Baggage propagator, which we will discuss later.

Since the Init() function is called many times by different services in the HotROD app, I only set the propagator once using sync.Once.

After this change, the traces looked better, more colorful, so I committed the change.

A better-looking trace after “fixing” the propagation.

Traces Still Broken

However, I should’ve paid better attention. Notice the lone span in the middle called /driver.DriverService/FindNearest. Let’s take a closer look:

A client span trying to make a gRPC request to service `driver`.

This span is a client-side of a gRPC request from frontend service to driver service. The latter is missing from the trace! This was a different issue with the context propagation. There was an error returned when trying to inject the context into the request headers. The instrumentation actually logged the error back into the client span, which we can see in the Logs section: Invalid Inject/Extract carrier. Unfortunately, it was difficult to spot this error without opening up the span, because the RPC itself was successful, and the instrumentation was correct in not setting the error=true span tag, which would’ve shown in the Jaeger UI as a red icon.

After a bit more digging I found the issue, which was due to a bug in the OpenTelemetry SDK’s bridge implementation. You can read about it in the following GitHub issue.

As of this writing, the fix if still waiting to be merged, so as a workaround I made a branch of opentracing-contrib/go-grpc and changed it to use TextMap propagation instead of HTTPHeaders, which by chance happened to work with the bridge code.

With these fixes, we were back to the “classic” HotROD traces.

Full HotROD trace, as the AI overlords intended.

RPC Metrics

I was ready to call it a day, but there was one piece missing. The original Jaeger SDK initialization code had one extra feature — it was enabling the collection of RPC metrics from spans (supported only by the Go SDK in Jaeger). My original blog post, Take OpenTracing for a HotROD ride, had a discussion about it, so it was a shame to lose this during this upgrade. If I were upgrading to the OpenTelemetry instrumentation as well, it might have contained a metrics-oriented instrumentation, although it would somewhat miss the point of the blog post that tracing instrumentation is already sufficient in this case. Another possibility is to generate metrics from spans using a special processor in the OpenTelemetry Collector, but using the Collector is not part of the HotROD demo setup.

The OpenTelemetry SDK has the notion of span processors, an abstract API invoked on all finished spans. It is similar to how the RPCMetricsObserver was implemented in the jaeger-client-go, so I did what any scrappy engineer would do — copy & paste the code from jaeger-client-go directly into the HotROD code and adopt it to implement otel.SpanProcessor. And voilà:

$ curl http://127.0.0.1:8083/debug/vars | grep '"requests.endpoint_HTTP'
"requests.endpoint_HTTP_GET_/.error_false": 3,
"requests.endpoint_HTTP_GET_/.error_true": 0,
"requests.endpoint_HTTP_GET_/config.error_false": 4,
"requests.endpoint_HTTP_GET_/config.error_true": 0,
"requests.endpoint_HTTP_GET_/customer.error_false": 4,
"requests.endpoint_HTTP_GET_/customer.error_true": 0,
"requests.endpoint_HTTP_GET_/debug/vars.error_false": 5,
"requests.endpoint_HTTP_GET_/debug/vars.error_true": 0,
"requests.endpoint_HTTP_GET_/dispatch.error_false": 4,
"requests.endpoint_HTTP_GET_/dispatch.error_true": 0,
"requests.endpoint_HTTP_GET_/route.error_false": 40,
"requests.endpoint_HTTP_GET_/route.error_true": 0,

Baggage

As I was looking through the metrics in HotROD, I realized there was another area I neglected. These sections in the expvar output were not supposed to be empty:

$ curl http://127.0.0.1:8083/debug/vars | grep route.calc.by
"route.calc.by.customer.sec": {},
"route.calc.by.session.sec": {}

These measures require baggage to work. The term “baggage” was introduced in the academia (Jonathan Mace et al., SOSP 2015 Best Paper Award). It refers to a general-purpose context propagation mechanism, which can be used to carry both the tracing context and any other contextual metadata across the distributed workflow execution. The HotROD app demonstrates a number of capabilities that require baggage propagation, and they were all completely broken after upgrading to OpenTelemetry SDK 😭.

The first thing that broke was propagation of baggage from the web UI. HotROD does not start the trace in the browser, only in the backend. The Jaeger SDK had a feature that allowed it to accept baggage from the incoming request even when there was no incoming tracing context. Internally the Jaeger SDK achieved this by returning an “invalid” SpanContext from the Extract method where the trace ID / span ID were blank, but the baggage was present. Digging through the OpenTracing Bridge code I found that it returns an error in this case. This could probably be fixed there, but I decided to add a workaround directly to HotROD where I used the OpenTelemetry’s Baggage propagator to extract the baggage from the request manually and then copy it into the span.

I trimmed down the code example above a bit to only show relevant parts. The otelBaggageExtractor function creates a middleware that manually extracts the baggage into the current Context. Then the instrumentation library nethttp is given a span observer (invoked after the server span is created) which copies the baggage from the context into the span. This functionality is only needed at the root span, because once the trace context is propagated through the workflow, the Bridge correctly propagates the baggage as well (remember that I registered Baggage propagator as a global propagator in the Init function, as shown in the earlier code snippet). I was actually pleasantly surprised that the maintainers were able to achieve that, because the OpenTracing API operates purely on Span objects, not on the Context, while in OpenTelemetry the baggage is carried in the Context, a lower logical level.

One other small change I had to make was to change the web UI to use the baggage header (per W3C standard), instead of the jaeger-baggage header that was recognized by the Jaeger SDK.

Strictly speaking, these were all the changes I had to make to the HotROD code to make the baggage work. Yet, it didn’t work. Some baggage values were correctly propagated, but others were missing. After more digging I found several places where it was silently dropped on the floor because of some (misplaced, in my opinion) validations in the baggage and the bridge/opentracing packages in the OpenTelemetry SDK. The ticket below explains the issue in more details.

Running against a patched version of OpenTelemetry SDK yielded the desired behavior and the baggage-reliant functionality was restored. I was getting performance metrics grouped by baggage values:

$ curl http://127.0.0.1:8083/debug/vars | grep route.calc.by
"route.calc.by.customer.sec": {
"Amazing Coffee Roasters": 0.9080000000000004,
"Japanese Desserts": 1.0490000000000002,
"Rachel's Floral Designs": 1.0090000000000003,
"Trom Chocolatier": 1.0000000000000004
},
"route.calc.by.session.sec": {
"2885": 1.4760000000000002,
"5161": 2.4899999999999993
}

And the mutex instrumentation was able to capture IDs of multiple transactions in the queue (see the original blog post for explanation of this one):

Logs show a transactions blocked on three other transactions.

Summary

Overall, the migration required fairly minimal amount of changes to the code, mostly because I chose to reuse the existing OpenTracing instrumentation and only swap the SDK from Jaeger to OpenTelemetry. The most friction with the migration was due to a couple of bugs in the OpenTelemetry Bridge code (and likely one place in the baggage package). This only leads me to believe that the baggage functionality is not yet widely used, especially when someone uses the OpenTracing instrumentation with a bridge to OpenTelemetry, so it is likely I just ran into a bunch of the early adopter issues.

At this point I am interested in taking the next step and doing a full migration of HotROD to OpenTelemetry (or help reviewing if someone wants to volunteer!) It could make a complementary Part 2 to this post to describe how that goes.

There is also a possible Part 3 involving a no-less interesting migration to the OpenTelemetry Metrics. Right now all of the Jaeger code base is using an internal abstraction for metrics backed by the Prometheus SDK.

Stay tuned.

--

--

Yuri Shkuro
JaegerTracing

Software engineer. Angel investor & adviser. Creator of tracing platform Jaeger. Author of “Mastering Distributed Tracing”. https://shkuro.com/