Trace Your Distributed Application with OpenTelemetry

Rachmat Priambudi
Tunaiku Tech
Published in
5 min readMar 31, 2021
Photo by Anne Nygård on Unsplash

In a pandemic situation like what we are facing now, often times we heard about how tracing is the most critical thing in our response to the situation. By doing a comprehensive tracing to a positive suspect, like who are they interacting to, where are the places that they are visiting, we can determine our best possible approach to control the situation and the spread of the viruses. Did you know, that there is also a similar concept about tracing in modern software application?

A Modern Software Application

A modern application usually does not consist of a single big, rigid monolithic architecture anymore. They consists of several smaller domain-specific services that will interacting with each other. This is what we called a microservice architecture. One of the best selling point in this architecture is that when we make changes, or scale up one of our business domain, it will not hugely affecting other domains. Neat, right?

Unfortunately, like all good things in life, sometimes those things are most likely will have an end. Let’s say we create an application for buying game wallet voucher. The simplest implementation diagram will looks like below.

Diagram of example application

From the diagram above, we can see that when user input their order to the payment service (buyGameCode), the service itself will call Game Vendor service to get the wallet code (getGameCode) and then will send notification about the purchase (sendUserNotification). Invisible to the users, but those service call will contribute to the total latency for buying game code process.

Let’s say our application has a typical response time of ~50ms. And then, one day, our users reporting that their buying process takes a lot more time than usual. Also, some of them did not get their notification, which include the redeemable code, resulted in they cannot redeem their vouchers. With our reputation on the line, we are searching through all of our server logs to identify if there are any abnormalities in our app. We can see some error occurred when our payment service doing the calls to some other services, but then we realized, how can we match the errors from any of the customers to these logs? We are taking so much time to identified these errors, that our users started to flee to our competitor. And then our business starting to dwindle.

Tracing in Software Application

Like tracing a patient in the pandemic, tracing in application also have the goals to identify any problem as quickly as possible in many interconnected services, especially in a microservice architecture. It will give us a clearer picture about one whole process, including any of the sub process. This includes latency, context, and any error happened during a single execution context. This is called distributed tracing.

Today, there is new rising standard about distributed tracing called OpenTelemetry. This is a combination between OpenTracing and OpenCensus, which makes it as a unified standard for all tracing-related standards, like trace, metric, and log.

Please note that for now, only the tracing specification that already in stable phase. For the other standards, since it is still not in a stable phase, there is a likelihood that it will have a breaking update in the future. You can visit https://opentelemetry.io for more information.

This is an example of how a trace will look like for OpenTelemetry.

Example of a trace in New Relic One

To give us a little bit more understanding about this graph, I will try to breakdown it like this.

Trace

Trace is the object that track all of the progression from a single request. We have the freedom to define what is a single request is. It can be a single http request from only one server, multiple server and also the whole request from the front-facing side, as long as you always put the trace context inside your request header. Trace context, per W3C definition, is:

The traceparent HTTP header field identifies the incoming request in a tracing system. It has four fields: version, trace-id, parent-id and trace-flags. It needs to be combined with a dash - in order to put them into the header value. The value will become like this <version>-<trace-id>-<parent-id>-<trace-flags>

We can view the full specification here. Also, it is recommended to use any client library instrumentation, like otelhttp in golang, so we don’t need to deal with this header manually. The whole information in the example above, is considered a one trace.

Span

Span is the object which contains all the metadata about the process, such as its name, start and end timestamps, attributes, events, and status. A trace will have exactly one root span in it, and represented as the longest graph which cover the entire process. And if we create a span from a context which is already hold the value of a traceparent, it will be considered as a child span, and represented as each graph under the root entity. If we create a span from a broken context (like have a valid trace-id but an invalid parent-id), it will be considered as an orphaned span.

We can also put any additional custom attributes to our span attribute, which will be useful to give any additional context about the process it represents.

So, there it is! Distributed tracing can help us to identify any unexpected events in our application, and also improving the responsiveness of our app by giving us more clearer context about what happened in a single request, especially if the request is making another request with api call. I hope this will give us a little bit of explanation what is the distributed tracing itself, and why we should start implement it in our app. I will give more details about how to implement it yourself in a future article. Cheers!

References

--

--