Distributed Tracing with OpenTelemetry

Yuri Grinshteyn
Jan 9 · 5 min read

Toward the end of last year, I had the good fortune of publishing a reference guide on using OpenCensus for distributed tracing. In it, I covered distributed tracing fundamentals, like traces, spans, and context propagation, and demonstrated using OpenCensus to instrument a simple pair of frontend/backend services written in Go. Since then, the OpenCensus and OpenTracing projects have merged into OpenTelemetry, a “single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application.” I wanted to attempt to reproduce the work I did in OpenCensus using the new project and see how much has changed.

Objective

For this exercise, I built a simple demo. It consists of two services. The frontend service receives an incoming request and makes a request to the backend. The backend receives the request and returns a response. Our objective is to trace this interaction to determine the overall response latency and understand how the two services and the network connectivity between them contribute to the overall latency.

In the original guide, the two services were deployed in two separate GKE clusters, but that is actually not necessary to demonstrate distributed tracing. For this exercise, we’ll simply run both services locally.

While the basic concepts are covered in the reference guide and in much greater detail in the Google Dapper research paper, it’s still worth briefly covering them here such that we can then understand how they’re implemented in the code.

From the reference guide:

A trace is the total of information that describes how a distributed system responds to a user request. Traces are composed of spans, where each span represents a specific request and response pair involved in serving the user request. The parent span describes the latency as observed by the end user. Each of the child spans describes how a particular service in the distributed system was called and responded to, with latency information captured for each.

This is well illustrated in the aforementioned research paper using this diagram:

Let’s take a look at how we can implement distributed tracing in our frontend/backend service pair using OpenTelemetry.

Note that most of this is adopted from the samples published by OpenTelemetry in their Github repo. I made relatively minor changes to add custom spans and use the Mux router, rather than just basic HTTP handling.

Frontend code

We’ll start by reviewing the frontend code.

Imports

First, the imports:

Mostly, we’re using a variety of OpenTelemetry libraries at this point. We’ll also use the Mux router to handle HTTP requests (I mostly use it because it seem to be similar to Express in Node.js).

Main function

Next, let’s have a look at the main() function for our service:

As you can tell, this is pretty straighftorward. We initialize tracing right at the start and use a Mux router to handle a single route for requests to /. We then start the server on port 8080. I added an environment variable to check to see whether I'm running the code locally to bypass the MacOS prompt to allow inbound network connections as per these instructions.

Initialize tracing

Next, let’s take a look at the initTracer() function:

Here, we’re simply instantiating the Stackdriver exporter and setting the sampling parameter to capture every trace.

Handle requests

Finally, let’s look at the mainHandler() function that is called to handle requests to /.

Here, we’re setting the name of the tracer to “OT-tracing-demo” and starting a root span labeled “incoming call”. We then create a child span of that labeled “backend call” and pass the context to it. We then create a request to our backend server, whose location is defined in an env variable and inject our context into that request — we’ll see how that context is used in the backend in a bit. Finally, we make the request, get the status code, and output a confirmation message. Pretty straightforward!

A couple of things to note further:

  • I am explicitly closing the child span, rather than using defer for more control over exactly when the timer is stopped.
  • I am adding events to spans for even more clear labeling.

Now, let’s look at our backend.

Backend code

Much of the code here is very similar to the frontend — we use the same exact main() and initTracer() functions to run the server and initialize tracing.

mainHandler

The mainHandler() function does look quite different. Here, we extract the span context from the incoming request, create a new request object using that context, and create a new span using that request context. We also add an event to our span for explicit labeling. Finally, we return "OK" to the caller and close our span. Again, I could have used defer span.End() instead of doing it explicitly.

Note the difference between span context and request context. This is specifically relevant when accepting incoming context and using it to create child spans. For further exploration of these two, take a look at the relevant documentation from OpenTracing.

Now that we’ve seen how to implement tracing instrumentation in our code, let’s take a look at what this instrumentation creates. We can run both frontend and backend locally after setting the relevant environment variables for each and using gcloud auth login to log in to Google Cloud. Once we do that, we can hit the frontend on http://localhost:8080 and issue a few requests. This should immediately result in traces being written to Stackdriver:

You can see the span names we specified in our code and the events we added for clearer labeling. One additional thing I was pleasantly surprised by is that OpenTelemetry explicitly adds steps for the HTTP/networking stack, including DNS, connecting, and sending and receiving data.

I greatly enjoyed attempting to reproduce the work I did in OpenCensus with OpenTelemetry and eventually found it understandable and clear, especially once I was pointed to the tracer.Start() method to create child spans. Come back next time when I attempt to use the stats features of OpenTelemetry to create custom metrics. Until then!

Originally published at http://github.com.

Google Cloud - Community

A collection of technical articles published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Yuri Grinshteyn

Written by

CRE at Google Cloud; sporadic coder; Stackdriver superfan

Google Cloud - Community

A collection of technical articles published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

More From Medium

More from Google Cloud - Community

More from Google Cloud - Community

More from Google Cloud - Community

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade