Distributed tracing with OpenTelemetry — Part 1
Running distributed systems in production is not an easy task. Access is limited, therefore issues are hard to track down and understand as well as to reproduce in other environments. No other environment is like production, no matter how similar it may seem. Usually you rely on different pieces of information to troubleshoot and understand what is happening to your applications in production. The most common ones are logs and custom metrics. This two things are really powerful when combined and used together. With logs you can understand and troubleshoot exactly what happened with a specific request or a scenario, while with metrics you can see the overall behaviour and performance of the system. It looks like a really good solution, and it really is, the problem is tuning it to reflect exactly what you need. This are couple of challenges I see when implementing observability and monitoring to distributed systems:
- Correlating logs with metrics — Depending on the platform you are using to persist the data, the correlation may be very difficult or very manual, specially if you use one platform for logs and another one for metrics. Very often you will end up manually correlating things by timestamp, unfortunately.
- Correlating logs and metrics from different systems — If you are not careful when implementing logs on different applications the information you export in logs may be different, in case of structured logging the template may be different (or even the format). This means that querying and trying to correlate things may become very tricky. Not to mention if the systems are implemented in different technologies and the support for logs and metrics is different.
- Vendor proprietary logic (lock-in) — One big problem is that each vendor (platform, backend systems) that you store your data with may have a different protocol, or set of libraries to interact with. This may cause a vendor lock-in, when you can’t switch to another vendor because of the burden and cost of refactor in your own code base to replace all the necessary logic to interact with their solution.
This points are usually not easy to solve and are a real pain for those who work with distributed systems in production. OpenTelemetry offers a viable way of tackling those points.
OpenTelemetry is a standard for implementing telemetry in your applications. It provides a specification, containing the requirements that all implementations should follow as well some implementations for major languages including an API and a SDK to interact with it. OpenTelemetry was born form the merger of two other standards that decided to unify forces instead of compete between each other, these projects were OpenTracing and OpenCensus. This brings maturity to the standard as both of the previous projects were already mature and production tested. Another important fact is that OpenTelemetry is part of the Cloud Native Computing Foundation (CNCF). As the time of writing this article the project is still a Sandbox Project. Just to mention the importance of it, other very famous projects started the same way as OpenTelemetry, including: Kubernetes, Prometheus and Jaeger (that we will be using later on).
Probably the most interesting thing about OpenTelemetry is its architecture. It was design so that you can easily extend it. That addresses the last point in our list. OpenTelemetry allows you to migrate from one vendor to another with very little effort. It uses the concept of exporters, these are the vendor specific implementation for how to send the data captured to the specific backend. So if you want to migrate from Jaeger to Zipkin, you only need to change the exporter you are using, nothing else. This decouples the rest of the implementation from your code base. It also helps to implement observability from the very beginning of development, even before choosing a backend for logs, metrics and traces. According to their website:
OpenTelemetry makes robust, portable telemetry a built-in feature of cloud-native software.
Logs and metrics are very useful and easy enough to use when we have a single system. However, when working with distributed systems things start to get complicated. Correlating things is not easy. Sometimes a symptom may appear in one system, but the real root cause may be in a different system, if you don’t have proper correlation you may never find out what happened to a given request or transaction. That’s where Distributed Tracing comes in. Distributed tracing, as the name suggest is a technique that allows us to trace requests over multiple systems, where these requests are correlated somehow. Usually the traces (or spans) have useful information, like the timestamps and even log messages. This helps a lot when troubleshooting performance issues, or even understanding why a given request failed and where the problem happened.
In our example we are going to see how to implement the concept of Context Propagation with OpenTelemetry. This will make all information we need to troubleshoot a specific request very easy to find and work with. That’s because all requests to all different systems involved in our demo application will be tied together.
Although OpenTelemetry has capability of capturing and exporting metrics (and more recently logs), we will not get any deeper on those subjects in these articles.
Next article and demo application
On the next article we will focus on implementing some services in different technologies, using OpenTelemetry SDK, sending (exporting) data to Jaeger. Here’s an overview diagram for the the next article solution’s architecture:
The demo application will be composed of a client application (client) written in go, that will communicate over gRPC to a web application (weather) written in go as well. The weather app will communicate via HTTP to another web application (temperature) written in C#. Using this setup we can explore couple of interesting things using OpenTelemetry. The first one is to see the implementation in different languages, which is nice. But the most interesting part in my opinion is the Context Propagation using different languages (golang and C#) and different protocols (gRPC and HTTP). All applications are going to record events on their spans and send everything to our distributed tracing backend Jaeger.
To start configuring the environment where this demo is going to run, let’s install (run) and configure Jaeger.
Installing and configuring Jaeger
The easiest way to install and run Jaeger locally is through docker. You can use the following command to run a new container for Jaeger. Bear in mind that we are not persisting the data to any volume meaning that if the container is stopped, all data will be lost. Not a big deal for a demo or testing purposes.
docker run -d --name jaeger \
-e COLLECTOR_ZIPKIN_HTTP_PORT=9411 \
-p 5775:5775/udp \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 14268:14268 \
-p 14250:14250 \
-p 9411:9411 \
To test if it is working, you can use the URL http://localhost:16686/ to access the Jaeger Frontend (query service).
Now that we have Jaeger set up we can start developing and running the services that compose the demo.
On the next article we will be going through the implementation details of each service and how to tie all requests to be visible under the same trace on Jaeger.