Distributed tracing with OpenTelemetry in your Go/Python microservices

Aliaksandr Damaronak

Published in

Adevinta Tech Blog

13 min readJan 25, 2023

Learn how to troubleshoot problems with microservices systems and walk through a real-life example

What is Distributed tracing?

Imagine you have a system made of 100+ microservices, communicating with different data sources and external systems. Everything works great for a year after the service launch. But suddenly, one day, the system becomes unreliable — it stops sending responses for some requests, or starts getting slower. What would you do to address it and understand what’s going wrong in the system? Distributed tracing deals with this kind of problem very well. We use it to troubleshoot systems at Adevinta and keep things running smoothly

A brief introduction

Distributed tracing is a method of tracking how requests flow inside your system. This can help you to highlight every suboptimal or broken part of the system by visualising the full path of requests.

Main benefits

Helps to spot system reliability issues by highlighting degraded parts of the system
Helps to define and maintain SLA

Distributed tracing lets us know exactly how long each step in the system takes to execute. This enables us to evaluate whether we aren’t in compliance with SLAs, and it can also help to define/adjust SLAs on new features.

Main issues

Human factors during manual instrumentation of services in a system

As services have to be instrumented manually, there is a chance that the developer can miss some parts of the system or doesn’t instrument it properly. This can lead to missing traces and a mess of spans in the visualisation tool.

OpenTelemetry allows us to instrument code automatically, but it doesn’t always bring the benefits you expect. In most cases, flexibility provided by automatic instrumentation libraries is not enough for real applications.

Example of tracing diagram

Trace diagrams show how requests flow through the system, which parts of the system are involved and how long they take to complete their work.

In the picture above we can see a trace.

Trace — representation of a series of causally related events that represents request flow through a system

It shows that the whole request took 500 ms to complete and the request visited two services — A and B, inside the A₀ span

Span — the basic unit of work, represents an event in our system

There are three other events that were executed — A₁, A₂, and B₀ (B service was called from unit A₂), then unit A₃ was executed and took 250 ms to complete.

Now let’s look at something that helps integrate distributed tracing into our systems.

OpenTelemetry

OpenTelemetry is a collection of tools, APIs, and SDKs that you can use to instrument, generate, collect and export telemetry data — metrics, logs, and traces, known as the three pillars of observability, in a vendor-agnostic way.

OpenTelemetry provides you with:

A single, vendor-agnostic instrumentation library per language with support for both automatic and manual instrumentation
A single, vendor-neutral collector binary that can be deployed in a variety of ways
An end-to-end implementation to generate, emit, collect, process, and export telemetry data
Full control of your data with the ability to send data to multiple destinations in parallel through configuration
Open-standard semantic conventions to ensure vendor-agnostic data collection
The ability to support multiple context propagation formats in parallel to assist with migrating as standards evolve

How to start using distributed tracing with OpenTelemetry in your system

There are three steps you should complete to adopt distributed tracing in your system:

Instrument your code
Collect and export tracing data
Analyse and visualise it

The first two points are covered by OpenTelemetry, and the third one is solved by observability backends.

In this article, we will use two different observability backends — Grafana Tempo and Jaeger. There is a big difference between them, specifically, in the guarantees they offer. We won’t get into discussing observability backends in depth in this article, but if you are curious, I can recommend these resources about head-based (Jaeger) sampling and tail-based (Tempo) sampling.

Real-life example

To give you a better understanding of distributed tracing and OpenTelemetry concepts, I created a system of two microservices and instrumented them. Now I will show you, step by step, how to instrument your Python/Go applications.

The system walkthrough

Imagine you have a Medium-like application, which consists of two microservices — the first service is responsible for showing articles to users (articles-microservice), and the second service recommends users similar articles to the ones they are currently reading (suggestions-microservice).

The source code and startup guide can be found in the repo README.

How data flows in the system

When you make a request to articles-microservice, articles-microservice retrieves requested data from the memory (data was loaded on the startup from the mocks/articles.json) and returns a list of articles.

It’s also possible to add with_suggested parameter to a request URL query. For every article in the list, articles-microservice will query suggestions-microservice which will retrieve a list of suggested article IDs from its memory (data was loaded on the startup from the mocks/suggestions.json). Then articles-microservice will enrich these IDs with articles data and this data combination to a user.

The project structure

├── README.md
├── apps // contains microservices applications
│   ├── articles // contains main microservice written on Go
│       ...
│   └── suggestions // folder contains auxiliary microservice, written in Python
│       ...
├── grafana //preconfigured dockerized grafana stack for run Tempo tracing backend
│   ├── docker-compose.yaml
│   ├── grafana-datasources.yaml
│   ├── prometheus.yaml
│   └── tempo.yaml
├── jaeger //preconfigured dockerized Jaeger tracing backend
│   └── docker-compose.yaml
├── requirements // project-level python requirements, mainly used by generate_mocks.py script
│   └── requirements.txt
└── scripts // contains scripts that help to automate something in the project, for now, there is only one script that generates mock data + its wrapper, to simplify script usage
├── generate_mocks
└── generate_mocks.py

How to run the project

The first thing you should do is generate mock data using the following command:

Generated files are our data sources for both microservices. I decided not to use a database in order to keep the project simple.

The next step is to run the microservices and tracing backend. As everything is dockerized for your convenience, you can run the whole stack with this command:

docker compose up -d && docker compose -f jaeger/docker-compose.yaml up -d

That’s it — the project is ready to work with.

Article microservice teardown (Golang application)

For our main microservice, I chose Golang and Go kit framework. The project follows a standard Go kit enforced structure, so it will be familiar to most readers who have experience building microservices with Go. This framework is widely used in the community.

The microservice folders structure

apps/articles
├── Dockerfile
├── articles // all internal app code resides here
│   ├── endpoint.go
│   ├── errors.go
│   ├── models.go
│   ├── service.go
│   └── transport.go
├── go.mod
├── go.sum
├── main.go // entry file
├── mocks
│   └── articles.json // generated mock file, our datasource
└── tracing // main tracing code is in this folder
├── middleware.go
└── provider.go

The service can do several things:

Show us a list of articles
Show detailed info for an article by id
If we ask, it will enrich articles objects with suggested articles from suggestions-microservice

How the service is instrumented

A huge part of the logic resides inside /tracing folder

provider.go - responsible for initialisation of tracing provider

middleware.go - middleware which registers the first span in the microservices on every request

Things worth mentioning from provider.go

middleware.go teardown

Then we have to manually create spans inside every function that we want to record in a final trace — a code sample follows.

How it works in practice

To see how it works, you should follow these steps.

Make a request

curl http://localhost:8001/articles/v1/ | jq

You will get a similar response to the one below.

Content might be different on your computer, so keep an eye only on the response structure.

{
  "articles": [
    {
      "id": "d18602d5a5554e5f98734b39dc56b176",
      "author": "Adrian Perez",
      "title": "Oil activity pay.",
      "text": "Happen education ever. Majority over teacher boy on arm. Somebody both opportunity us per.",
    },
    {
      "id": "80c056b351334da5b3c95f4b38bbb90f",
      "author": "Melissa Nichols",
      "title": "Fear room federal.",
      "text": "Building result difference hold. Imagine professional media both.\nReal building gas into. Example world skill treat. Return time raise decade choose.",
    },
    ....
  ]
}

2. Open Jaeger UI and find a recently created trace. It should look similar to what you see in the screenshot below

On this page, we can spot several things

Trace was created for /articles/v1/ endpoint
The whole request took 159 ms to complete
Spans were created for each functional unit that the request visited
The trace graph follows the project structure, as request decoder, endpoint and request encoder are on the same level in the go-kit model, and getArticles service called by the endpoint -that shows that they are causally dependent
Trace shows how much time each unit took to complete

Suggestions microservice teardown (Python application)

To show you how tracing helps in tracking request flows across a system made of several microservices, I wrote a service that acts as an external component for articles-microservice. This service shows suggestions for a given article. This suggestions-microservice was written in Python. As a framework I choose Flask — this framework is well-known in the community, and there is a good chance most readers are familiar with it.

There is a library for Flask-based projects auto-instrumentation, but the goal of this article is to show readers how to instrument services manually, as it gives you more flexibility. But if you’re interested in auto-instrumentation, please read this page of the OpenTelemetry documentation.

folders structure

apps/suggestions
├── Dockerfile
├── app.py // entry file
├── mocks
│   └── suggestions.json // generated mock file, our datasource
├── requirements
│   └── requirements.txt // requirements specific to the microservice
├── suggestions
│   ├── __init__.py
│   ├── containers.py
│   ├── service.py
│   └── views.py
└── tracing.py // all tracing code resides here

How the service is instrumented

The whole tracing logic resides in the tracing.py file. There are two main parts:

Provider initialiser, the code is pretty similar to what we have in articles microservice, but written in python
Tracing decorator

The most interesting piece of the code here is an implementation of tracing decorator. It's well documented right in the code, but I want to say more about what it can do.

How to use the decorator

We can decorate classes or functions.

When we decorate a class, all its methods automatically decorate with @Trace() and provided arguments, it also works for classmethods and staticmethods.

The code above means that all methods of SuggestionsService will be instrumented. This is equivalent to:

If we want to disable tracing for some methods of a class we can decorate these methods with @Trace(disable_tracing=True)

If we want to disable tracing on selected endpoints we have to decorate its view with @Trace(disable_tracing=True)

This will automatically disable tracing for the whole request, even on methods used inside get_by_article_id view.

Lastly, it’s worth mentioning that this decorator doesn’t add any overhead in runtime, as all heavy introspection happens during the module’s initialisation step, during the application startup.

How it works in practice

To see how it works you should follow these steps:

1. Make a request

curl http://localhost:8002/suggestions/v1/<article_id> | jq

You will get a list of suggested articles for the given article_id

As before, content might be different on your computer, so keep an eye only on the response structure.

[
  "fd0236aca4474cfa8d4e2b3718dd6fe0",
  "9fd6a8b6557147c396e0f42046de7f80",
  "c2317f1138324945a836cceb902e25d1"
]

2. Open Jaeger UI and find a recently created trace. It should look similar to what you see in the screenshot below.

The content of the screenshot is similar to what we’ve already seen in articles-microservice, but now we see there are spans from suggestions-microservice.

Wiring services together

To wire tracing data of all services together, we should pass span context from a caller to a target service. For this purpose, we should use propagators.

Propagators — objects used to read and write span context data to and from messages exchanged by the applications

Types of Propagator

Propagators API currently defines one Propagator type:

TextMapPropagator is a type that injects values into and extracts values from carriers as string key/value pairs.

Carrier — A carrier is a medium used by Propagators to read values from and write values to. Each specific Propagator type defines its expected carrier type, such as a string map or a byte array.

For more details, I recommend reading the official OpenTelemetry doc.

How it’s done in the code

I added logic to several places

articles-microservice — on provider initialisation, because we want to carry span context over through HTTP headers when we make requests to suggestions-microservice (full source code)

otel.SetTextMapPropagator(propagation.TraceContext{})

2. articles-microservice — tracing middleware, where I also extract context from HTTP headers using HeaderCarrier (full source code)

It’s a good practice to extract context on the top input layer, even if a service is not meant to be called from other services because there is a chance it will be needed in the future, and this extra instrumentation can also spot service misuse.

3. articles-microservice — we also should instrument HTTP Client which calls suggestions-microservice

Now on every call, this client will partially automatically inject parent span context into headers — partially automatically because we have to provide parent context to the request manually, like that (full source code)

4. The last thing we have to do is to extract context from HTTP headers in suggestions-microservice (full source code)

and use this context as a parent context for the first span (full source code)

To test it, we should query articles microservice API with additional parameter with_suggested=1

curl http://localhost:8001/articles/v1/?with_suggested=1 | jq

The response should look similar to this one. The new suggested key is now presented in the response.

{
  "articles": [
    {
      "id": "fd0236aca4474cfa8d4e2b3718dd6fe0",
      "author": "Timothy Foster",
      "title": "Detail assume wonder.",
      "text": "Clear alone wonder movement difference consumer. Bring lose short.\nTeach charge cup service. Fine yeah per wrong public. Until bit page real later third old focus.",
      "suggested": [
        {
          "id": "2119cd10997741cb81c1dcc058e2abd4",
          "author": "Christopher Huff",
          "title": "Gas company discussion whatever.",
          "text": "Either within behavior may. Color purpose reality wonder. Better during party.\nLittle ago measure themselves meet. Do their perform among."
        },
        {
          "id": "2119cd10997741cb81c1dcc058e2abd4",
          "author": "Christopher Huff",
          "title": "Gas company discussion whatever.",
          "text": "Either within behavior may. Color purpose reality wonder. Better during party.\nLittle ago measure themselves meet. Do their perform among."
        },
        {
          "id": "d4dc9ba64ee14f2591472a2d0fe17148",
          "author": "Nathan Wright",
          "title": "While at on.",
          "text": "Senior collection conference process become. Follow product process phone special.\nType hotel role smile although unit. Condition reveal generation like."
        }
      ]
    },
    ...
]}

Now open Jaeger UI and let’s see how it shows the request flow between these two microservices.

On the search page, we see a new trace entity. Notice that now it shows that the request visited two services.

The graph shows us exactly which function of articles-microservice called which methods of suggestions-microservice, and how requests visited both microservices.

From the article-microservice we call suggestions-microservice for every article in the list, it’s suboptimal for a production system but nice for the demo project as it shows clearly how request flows between services.

Benefits of using vendor-agnostic instrumentation

As was mentioned above, OpenTelemetry offers a vendor-agnostic way to instrument applications. Previously, we were using Jaeger as a tracing backend, but what if we want to switch to something else?

The only thing you need to do is to replace the tracing backend URL (and auth method in some cases) with a new one that points to a new tracing backend.

To make it work on your machine, follow these simple steps:

Stop Jaeger service

docker compose -f jaeger/docker-compose.yaml stop

2. Run Grafana stack docker-compose

docker compose -f grafana/docker-compose.yaml up

3. Replace TRACING_BACKEND_URL with a relevant one on both microservices .env.docker files

# TRACING_BACKEND_URL=jaeger:4317
TRACING_BACKEND_URL=tempo:4317

4. Restart the main docker-compose

5. Query articles microservice with with_suggested=1 flag

curl http://localhost:8001/articles/v1/?with_suggested=1 | jq

6. Open Grafana UI

7. On the left menu open Explore tab

8. Open the search tab and press Run query button, you should find a new trace there

9. Open the trace

On the right side of the screen, a graph opens, and it is the same graph you saw in Jaeger

Services colours are pretty similar because Grafana calculates colours based on service name, and as our services both have *-service postfixes, their hashes look almost the same — which leads to this annoying behaviour.

Conclusions

We explored distributed tracing and OpenTelemetry. We reviewed how to add distributed tracing with OpenTelemetry to the Go and Python applications, and how to track requests across microservice environments. We briefly explored how to visualise tracing data with Jaeger and Grafana Tempo.

I hope this article has helped you to build your expertise in distributed tracing and OpenTelemetry and helps to improve the observability of your services.