Demystifying OpenTelemetry: A Beginner’s Guide to Modern Observability

Pradyuman
Debugging Diaries
Published in
8 min readApr 30, 2024

Introduction to Observability

I have been asked this question quite a lot of times now, “What do you mean by observability?” To answer this question I have settled upon three levels of answer depending on the person asking me this question:

  1. Non-Technical Person: It is a way to check how a application/server is performing.
  2. Technical(but not DevOps): Addition to the monitoring system.
  3. DevOps People: Combination of 4 pillars: Metrics, Alerts, Logs and Traces also known as MALT.

So is it really important for a system to have all these pillars at full strength? Short answer definitely not.

It all depends on your requirements, if you can afford downtime on your website I guess you don’t even need any one of these systems. If the system goes down your customers can let you know about it and then you can start fixing the bug which can be anywhere in your code base.

Doesn’t sound good. So now you know why there is a need for observability.

The Evolution of Observability Tools

Observability has come far from where it all started. It all started when to keep check of the website we relied completely on users complaining about the system going down and then we checked the system. Now that this didn’t look like a good idea to wait for users to tell you about your system’s status developers started using separate servers to check the status(up/down) of the server in which the application was running.

It was a good start but didn’t provide the root cause of why the system went down. To solve this problem 2 major things were introduced, Logs and Metrics. Logs gave information about what went wrong in the application while metrics gave information about the performance of hardware.

As time went by rather than depending on someone continuously looking into metrics and informing the concerned person, alerts came into the picture. If your system starts behaving weirdly it will send communication to the concerned person automatically hence reducing human effort.

To reduce the resolution time for an error, traces were also introduced. They provided a deeper understanding of how a request is flowing through the system. And helping us pinpoint the code line that might be causing the issue.

Understanding OpenTelemetry

So now you might be wondering if all this evaluation was done what was the need for OpenTelemetry?

The problem with all these pillars was that everyone was doing it in their way. There was no standard way. OpenTelemetry came into the picture and provided a standard to be followed for observability. It followed 2 key principles:

  1. You own the data that you generate. There’s no vendor lock-in.
  2. You only have to learn a single set of APIs and conventions.

Although it started with a bare minimum number of platforms where you can export your data to now it has grown to a mind-boggling set of over 45 exportable platforms.

How OpenTelemetry Works

OTEL Collector

There are three main parts of OpenTelemetry:

  1. Data Collection (Receiver in OTEL Collector)
  2. Data Processing (Processors in OTEL Collector)
  3. Data Exporting (Exporters in OTEL Collector)

In a broader sense, OTEL Collector works quite similar to what we were taught about the working of computers in school. Takes in input, processes it, and gives a desired output. Now let’s get deeper into each of these sections:

Data Collection

Even though OTEL collector can accept data from many sources, it also has its own Protocol known as OTLP (OpenTelemetry Protocol). This provides you with SDKs, APIs, and tools to instrument, generate, collect, and export telemetry data (metrics, traces, and logs).

This data collection is done with the help of instrumenting an app. Major languages such as Java, and Python have their auto instrumentation, which means the user has to just run a command and it will automatically detect the libraries used and start generating telemetry data for the same. OpenTelemetry also has support for other languages as well where you can add a few lines of code to your application to generate telemetry data.

This generated data is now forwarded to the Receiver of OTEL Collector, which transfers this data to the processor part.

Data Processing

Once data is received here, it is polished and refined based on configuration. We can do multiple things with the received data, for example, adding some more metadata, filtering out useless data, batching data, etc. Once these operations are done, data is ready to be exported.

Data Exporting

Since the OTEL collector supports multiple destinations for sending data, it becomes a really important part of the collector. Based on the configuration we export data to the desired platform such as SaaS Platforms (Newrelic, Datadog, etc) or Databases (ClickHouse, Cassandra, etc) or Streaming Platform (Kafka, RabbitMQ, etc)

Benefits of OpenTelemetry

Now you might have this big question in mind. “Why do you want to increase your work by adding another layer to your current observability solution?”

It becomes really obvious if we add something to our system it should provide us with some value. Here’s what OpenTelemetry has to offer us:

Standardization

OpenTelemetry has become the industry-wide standard for observability solutions and major players like NewRelic, Datadog, and Splunk have started adopting this standard, making the learning curve adoption of a new tool easier.

Interoperability

Since OpenTelemetry is based on this principle from any source to any destination, this provides flexibility for developers to choose from various data collection methods and the DevOps team to choose from various visualization tools.

Avoid Vendor Lock-in

The biggest problem with organizations not being able to move to new and advanced tools available in the market is Vendor Lock-in. Their existing monitoring solutions are so deeply engraved into their systems that exiting them is a complete nightmare.
However, implementing OpenTelemetry helps you get rid of this problem. As you don’t have to make any changes to the source code to change your vendor it becomes a matter of a few configuration changes to achieve it.

Community and Ecosystem

OpenTelemetry is a CNCF-backed project and has active community support. They have monthly end-user meetings for resolving issues faced by users, very prompt replies to queries of users on their slack channels, and well-documented code. These all combined make OpenTelemetry easy to adapt.

Getting Started with OpenTelemetry

I could have started a project from scratch and shown you how to implement OpenTelemetry in this but this job is done well by the maintainers of OpenTelemetry project for almost all the languages. Here’s how they have done it:

Architectural Diagram of Demo

In this, they have created various microservices in different languages just like any organization. In each of these microservices, they have instrumented the code to send telemetry data to the OpenTelemetry Collector.

Data Flow Diagram

These services now send the data using OTLP protocol to OpenTelemetry Collector which processes it and further sends it to the next stage. For detailed implementation, you can refer to their documentation.

Challenges and Considerations

Implementing OpenTelemetry is not a cakewalk. It has its own set of challenges for implementation and adoption.

Learning Curve

Even though OpenTelemetry reduces learning cure for future changes but when we are just getting started with it learning curve is really steep. You have to learn about so many terminologies, language specific instrumentation etc making it not that easy to adopt.

Performance Overhead

If not implemented correctly instead of helping us improve the performance of the application it will degrade the performance of the application itself. This could become a bottleneck for a lot of systems.

Migrating from Legacy Systems

Updating legacy systems can be tricky. Bringing in new tech like OpenTelemetry might hit compatibility bumps. However, tackling these challenges head-on sets the stage for better system insights and performance.

It’s really important to follow some of the industry-wide best practices while implementing OpenTelemetry to your system to avoid Pitfalls. Some of the best practices are:

  • Choose automated context propagation through OpenTelemetry’s instrumentation libraries.
  • Find a sampling strategy that fits your use case.
  • Batch and compress telemetry based on size or time to query data faster.
  • Ensure you can correlate this data seamlessly so that you can jump to the correct data no matter what backend it’s stored in.

And many more that you can find on the internet.

Future of Observability with OpenTelemetry

OpenTelemetry is quite far from where it started but, it still has quite far to go. It has many interesting initiatives in its roadmap but here are some of them that I am personally interested in:

Client Instrumentation (RUM)

RUM (Real User Monitoring) is a really important part of helping in getting developers get better insights about the application. Seeing this in the Roadmap of OpenTelemetry makes me excited.

Profiling

Profiling is something that comes in handy when talking about the performance improvement of an application. It has been a topic of discussion within OpenTelemetry, and contributors for quite some time and now seeing its roadmap makes me more bullish about it.

Innovation isn’t just about introducing a shiny new feature; it’s about orchestrating a symphony of elements that harmonize to drive adoption. OpenTelemetry’s meteoric rise isn’t merely a solo act of development prowess; it’s a testament to the ecosystem rallying around it. Organizations are embracing it with open arms, recognizing its transformative potential in reshaping our perception of observability.

The tide seems to be turning in OpenTelemetry’s favor, as its adoption surges across industries at a breathtaking pace. It’s not just another tool; it’s the cornerstone of a new observability paradigm. OpenTelemetry isn’t just a contender; it’s swiftly becoming the undisputed standard bearer of the observability stack.

Conclusion

If you’ve journeyed this far, your curiosity about OpenTelemetry must be piqued. I’ve endeavored to provide a balanced explanation of OpenTelemetry — not delving too deep into the technical minutiae yet not skimming the surface too lightly either. My aim is for you to glean some fresh insights from this discourse, perhaps even sparking a newfound interest or understanding in the realm of OpenTelemetry.

Hey there! I’m excited to share with you the groundbreaking product we’re crafting at CTRLB. We’re dedicated to helping organizations slash their observability costs while maximizing efficiency. Take a moment to explore what we’ve been working on — it could be a game-changer for your team. Check it out here!

References

--

--

Pradyuman
Debugging Diaries

Architect of Insight: A Devotee to the Art of Observability | Building Next-Gen Observability Solutions @ CtrlB 🔍