Cloud Trace spanning PubSub
Tracing across PubSub (or other asynchronous action)
Imagine we have an application which publishes a message to Google Cloud PubSub. The message is then, in turn, picked up by another application and processed. From a logical perspective, what we have is:
Our goal is to configure Cloud Trace to generate a trace record that will show us the total duration for the activity. Cloud Trace can be used to instrument a single application to show us the duration for that one app, but needs tweaking to show us the duration for the end-to-end activity.
We will illustrate with a couple of apps written in Java.
When App1
starts, it creates a Cloud Trace context that records the duration of App1
from start to end. During the execution of App1
, it publishes a message to PubSub.
App2
is assumed to be continuously running. It listens for messages arriving at PubSub via a subscription. When a new message is received, the trace context that originated from App1
is extracted from the PubSub message attributes. App2
now creates a new Cloud Trace span that is associated with the trace originally created by App1
. App2
then records its own processing time through Cloud Trace that is used to augment the original App1
trace record. The result is that we end up with one trace record for the complete sequence of steps encapsulating both the execution of App1
and App2
. We will also see the duration that the message languished in the topic on PubSub. This will be shown as the lag (if any) between the end of App1
and the start of App2
.
Before we go into more detail, here is an example of a trace record that will be produced:
What we see is a single trace record that contains two spans. One is called “App1: The publisher
” and the other is called “App2: The subscriber
”. From this, we see:
- The duration of the application’s overall processing (4.246s)
- When the publish started (2.053s from start)
- When the publish completed (2.776s from start)
- When App2 started
As you can see, lots of great information. Now let’s see the details.
To use Cloud Trace, we have to instrument our code. Cloud Trace provides an API that leverages the OpenTelemetry trace specification and implementation. We start by initializing our Cloud Trace environment:
private static OpenTelemetrySdk setupTraceExporter() throws Exception {
// Using default project ID and Credentials
TraceConfiguration configuration = TraceConfiguration.builder().setDeadline(Duration.ofMillis(30000)).build();
SpanExporter traceExporter = TraceExporter.createWithConfiguration(configuration);
// Register the TraceExporter with OpenTelemetry
return OpenTelemetrySdk.builder()
.setTracerProvider(
SdkTracerProvider.builder()
.addSpanProcessor(BatchSpanProcessor.builder(traceExporter).build())
.build())
.buildAndRegisterGlobal();
} // setupTraceExporter
This is boiler plate.
The real key is the following:
OpenTelemetrySdk openTelemetrySdk = setupTraceExporter();
Span span =
openTelemetrySdk.getTracer(App1.class.getName())
.spanBuilder("App1: The publisher")
.startSpan();
try (Scope scope = span.makeCurrent()) {
// Do some work
} finally {
span.end();
}
Our goal is to create an instance of a “span”. The span is the block that measures a section of time. A span has a string (“App1: The publisher
”) that shows prominently on the trace explorer. We can also add attributes and events within the span. Within the span we do some work (your logic) and then declare that the span has ended.
So far, we have merely described some of the core notions of OpenTelemetry … the focus of this article is to describe how we can illustrate continuing a trace across asynchronous components (i.e. PubSub). Now we have to expose the notion of two ideas. These are the traceId and the spanId. When work is recorded by Cloud Trace, it is recorded under the notion of a single trace unit. This is the start of the whole work to the end of the whole work. Such a trace is identified by a globally unique value called the traceId. Within the trace, we have one or more spans where each span represents a specific task within the solution as a whole. Similar to the overall trace, each span has a globally unique span identifier (spanId). Spans have a parent/child relationship with each other. One span has no parent and is called the root span. All other spans have a relationship to another span.
To achieve our distributed story, we extract the traceId and the spanId from our current application and pass those as contextual properties to the downstream asynchronous component. When that component starts, it extracts the context (traceId and spanId of the original component) and then creates its own span but forces that new span to have the same parent trace and be a child of the previous span. The data now recorded by the asynchronously started component is now logically associated with the original component and when visualized, we see one trace with two spans contained within.
Within our code, we can extract the current traceId and spanId from the current span using:
String traceId = span.getSpanContext().getTraceId();
String spanId = span.getSpanContext().getSpanId();
We can then propagate these values to the asynchronous component. For example, with PubSub, we can set these as attributes in a published message.
When the asynchronously started component begins, it will retrieve and use the values.
Here is a code fragment illustrating this:
SpanContext spanContext = SpanContext.createFromRemoteParent(
traceId,
spanId,
TraceFlags.getSampled(), TraceState.getDefault());
Context context = Context.root().with(Span.wrap(spanContext));
Span span =
openTelemetrySdk.getTracer(App2.class.getName())
.spanBuilder("App2: The subscriber")
.setParent(context)
.startSpan();
try (Scope scope = span.makeCurrent()) {
// Do work
The core notion in the above is the explicit creation of the context that is passed to the span when it is created.
And finally … a short video illustrating the story described here.
See also:
- GitHub repository for the sample code.
- Cloud Trace overview
- OpenTelemetry