Context Management in OpenTelemetry NodeJS

Amir B
Aspecto
Published in
6 min readSep 21, 2020

This post shares some of our experience gained by writing nodejs plugins and debugging context issues at Aspecto.

You might find it interesting if:

  • You are developing an instrumentation plugin for opentelemetry in node.
  • You get broken-traces or the structure of the trace tree is not as expected.
  • You would like to have a deeper understanding of how the magic works under the hood.

So you started using OpenTelemetry in your javascript application to collect traces. Since you are reading an article about context management, I assume that you already read and understand what OpenTelemetry is all about and the basics of how to setup tracing for your application. You probably already run some example application like the basic-tracer-node under the “examples” directory in opentelemetry-js repository, and got a trace in Jaeger:

In the trace above we have one main span, with 10 doWork child spans. This child-parent connection between main and doWork is what we want to talk about - how doWork knows to link to its parent?

What is Span Context

A trace is a collection of spans, with some relation between the spans (parent <> children). In the example above we have 11 different spans which all belong to the same trace.

In OpenTelemetry, each span is handled autonomously — it starts, ends, and processed orthogonally to other spans. Jaeger received each of the above spans individually and might receive many other spans from other traces in between. Since those traces share the same traceId, Jaeger can fuse them into one trace. It draws the spans as a tree because each span record the span id of it’s parent. We’ll call these two properties SpanContext. They are captured each time a span is created via the tracer.startSpan() function.

Span context is the glue that connects those separate spans into a trace. The SpanContext interface is defined in @opentelemetry/api:

export interface SpanContext {
traceId: string;
spanId: string;
// and few other context attributes
}

The Goal

The goal is to have a complete, well-structured trace tree.

If everything works properly, all you need to do is to install open telemetry plugins for the libraries you are using and want to auto instrument, setup and configure OpenTelemetry tracing, and should not bother yourself with context.

When things don’t work well (for various reasons), you might encounter:

  • Broken traces — where you expect to see one trace for a single execution, but receive multiple traces where each one is a sub-part of the full trace. For example: If your application receives an http call and writes something to the database, and you get two traces: one with http span, and another with database span.
  • Wrong tree hierarchy — when you logically expect a span A to be the child of some span B, but instead it resides under another span C, giving the impression that the flow of the code is different from what it really is.

How a Span Gets its Context?

The most straightforward way to link a span to its parent is to set the parent option on SpanOptionswhen starting the span:

const span = tracer.startSpan('doWork', { parent });

The parentspan stores its own id and the trace id it belongs to, which becomes the parentSpanId and traceId of the new span. This method is sometimes useful, but it doesn’t fit for general instrumentation plugins which patch module functions and cannot usually pass around the parent span object to other plugins or places where it is needed.

Let's illustrate with an example - we’ll take the basic-trace-node example that we used earlier and change it a bit:

  • Replace the doWork function with a more realistic example that uses async-await semantic and axios to GETsome data from an external HTTP endpoint.
  • Install the https plugin, so outgoing HTTP traffic from thedoWork function will create a span per request.
  • Reduce the work spans from 10 to 2, so Jaeger will be less noisy, and await for there completion with `Promise.all()`

The code after those modifications becomes:

If we run the above example and check Jaeger, we’ll see broken-traces — there are 3 traces instead of one! Our old friend “main” span as one-span trace, and two additional traces with single-span each, for the two HTTP operations. The issue is that the two HTTP spans were created internally in the @opentelemetry/https plugin and can’t tell their parent is “main” span. To fix the issue, we need to set a scope for the “main” span with the `tracer.withSpan` function:

ThewithSpan function executes its callback parameter in such a way that when a new span is started inside that function or one of the cascading sync or async calls, the new span context is set correctly to parentSpan. Since we called doWork() from within the scope of withSpan , and it calls axios.get() which calls the instrumented https module - the HTTP spans are now created with the right traceId and parentSpan:

We can see one complete trace, as well as that the two requests are executed simultaneously since they run with Promise.all and async function. We fixed the broken-traces issue by setting the scope for the span.

ContextManager

ContextManager is a global, singleton service of the OpenTelemetry js framework which tracks the currently active context at any time. It is used by the tracer to query for the current active Context using the active()function, and to set scoped Context for following function calls using the with() and bind() functions.

AsyncHooksContextManager

OpenTelemetry will take care of setting the correct context for any function called from inside the with() scope, even for async function and promises which start new spans after being awaited.

For example, the following code will produce a correct trace where the three HTTP spans are children of the main span:

The context will propagate correctly to the timer callback being executed after 1 second, to the awaited `axios.get` call, and to the non awaited, second axios.get call. This is done by subscribing to various lifecycle events of asynchronous resources using the async_hooks node module.

Debugging

If you are writing a plugin, or need to debug an issue with span context not being set correctly, here are some tips from our experience at Aspecto:

  • To see the current active context, you can examine the return value from opentelemetry.context.active() imported as const opentelemetry = require("@opentelemetry/api") . This function will return a Context object, which is an empty object if there is no active context, and object with key such as “Symbol(OpenTelemetry Context Key ACTIVE_SPAN)” if a context is set.
  • If the code is using `thenables` objects (objects that has a then()method, and behave like a promise, but are not a real promise), then due to a bug in V8, context will not propagate across the thenable call. The following example code will create child_span as a new trace, and will not link it correctly under the main parent, even though it is awaited from inside the withSpan call.
  • The withSpan function is a common practice in js plugins. It scopes the part of code that should run under the context of a specific span. Another common practice is to use the bind() function, which binds a context to a later run of an arbitrary function or EventEmitter. It can sometimes be surprising to run a function and receive a context that is different from what’s stated in withSpan. So keep that in mind in case you see some voodoo context being set and not sure where it is coming from.
  • Make sure your spans are scoped correctly. In many modules, there are complicated callbacks logic and interactions with other modules. If your trace is not structured correctly, check which parts of the code resides under the withSpan() and bind() functions.

if you have any questions on how to leverage OpenTelemetry on your environment feel free to reach me or the Aspecto team.

--

--