Distributed tracing with OpenTelemetry — Part 1

Understanding OpenTelemetry and distributed tracing

Ricardo Linck
The Startup

--

Photo by Luke Chesser on Unsplash

Running distributed systems in production is not an easy task. Access is limited, therefore issues are hard to track down and understand as well as to reproduce in other environments. No other environment is like production, no matter how similar it may seem. Usually you rely on different pieces of information to troubleshoot and understand what is happening to your applications in production. The most common ones are logs and custom metrics. This two things are really powerful when combined and used together. With logs you can understand and troubleshoot exactly what happened with a specific request or a scenario, while with metrics you can see the overall behaviour and performance of the system. It looks like a really good solution, and it really is, the problem is tuning it to reflect exactly what you need. This are couple of challenges I see when implementing observability and monitoring to distributed systems:

  • Correlating logs with metrics — Depending on the platform you are using to persist the data, the correlation may be very difficult or very manual, specially if you use one platform for logs and another one for metrics. Very often you will end up manually correlating things by timestamp, unfortunately.

--

--

Ricardo Linck
The Startup

Software Engineer. Distributed systems lover, golang and .net enthusiast. Curious by nature. https://github.com/ricardolinck