OpenTelemetry Journey

The Highs, Lows, and Lessons Learned

8 min readSep 25, 2024

Introduction

When I first heard about OpenTelemetry, I was impressed by its promises as a new “observability framework”, that aimed to simplify data collection from various sources and formats (for example, OTLP, Jaeger, Prometheus, as well as many commercial/proprietary tools), and send it to one or more backend systems where the data is stored, analyzed and visualized.

As the second most-used CNCF project, OpenTelemetry has gained significant credibility. In many organizations, monitoring and logging are handled by different tools for different data types. For instance, you might use Grafana Cloud for metrics, Honeycomb or Lightstep for traces, and Splunk or Elastic for logs. This fragmentation can lead to inefficiencies and make it difficult to obtain a complete view of your application’s performance.

OpenTelemetry aims to solve this by providing a unified standard for logs, metrics, and traces, enabling smoother collaboration across various tools, so rather than juggling multiple tools for different signals, you can now manage all these signals in one place, reducing the hassle of dealing with multiple systems and helping you avoid vendor lock-in

My journey with OpenTelemetry began with excitement and curiosity. As someone who has spent countless hours troubleshooting and monitoring services both as an Incident Commander and a Subject Matter Expert, the idea of a unified framework for logs, metrics and traces sounded promising. So, I dove in headfirst, eager to learn, experiment, and, of course, face the challenges that come with any new technology.

In this post, I’ll share my journey with OpenTelemetry, covering both its strengths and the hurdles encountered. Whether you’re new to OpenTelemetry or have some experience, I hope my reflections on the good, the bad, and everything in between will give you a deeper understanding of what it’s like to work with this powerful tool.

Understanding Observability

You’ve probably heard the phrase before: Observability is about gaining a comprehensive understanding of a system’s internal state by monitoring its external outputs. The three primary “signals” or pillars used to achieve this are traces, metrics, and logs.

• Traces give a step-by-step view of a request’s journey through your system.

• Metrics provide quantifiable data about system performance and resource usage.

• Logs record events and details that help with understanding system state and debugging.

Understanding OpenTelemetry

OpenTelemetry, or ‘OTel,’ is an open-source observability framework. It makes observability possible by collecting data from your application and infrastructure, through various means, including instrumentation libraries, exporters, and receivers like the filelog and prometheus receivers, and sends it to a backend system of your choice for storage and analysis, either directly or through an OpenTelemetry collector.

To better understand the instrumentation journey, I recommend checking out Honeycomb’s blog post titled “The Real Instrumentation Journey”. This post outlines the essential steps in the instrumentation process, which can be incredibly helpful as you begin working with OpenTelemetry.

What really sets OpenTelemetry apart is that it’s vendor-agnostic and provides a standardized way to collect observability data. Whether the application is written in Java, Python, or Go, OpenTelemetry can handle it.

I like to think of it as the Swiss army knife of observability.

Why We Chose OpenTelemetry

Using and managing multiple tools during incident troubleshooting can be frustrating. Logging into several observability platforms — sometimes for the same type of signal — was inefficient. Each tool had its own login, and there were times when, just as things were getting critical, I realized I didn’t have the right access.

The flexibility of being vendor-agnostic, meaning that we can use any tool or service for collecting and analyzing data without being tied to a specific provider is extremely valuable, especially in large organizations where switching tools can be nearly impossible without a huge effort, as its both complex and costly. OpenTelemetry’s vendor-agnostic nature changes the game, by letting you replace vendor-specific agents with OTel agents / collectors, giving you the flexibility to choose different observability platforms, without the fear of having to rebuild everything from scratch. I understood that if we do this right, we wont’ need to keep redoing this ever again , for the next time the platform is changed.

The Highs: Wins Along the Way

My first impression of OpenTelemetry was overwhelmingly positive. Getting started was surprisingly straightforward. Deploying the OpenTelemetry agent and collecting data from applications and clusters, sending the data through the collector to a backend, was smooth. The getting started documentation examples made it easy to get things up and running.

The auto-instrumentation for languages like Python and Java worked exceptionally well. It was impressive to see how much valuable data I could gather with minimal effort. Since I mostly use Java and Python, it was great to see that these languages are well-supported by OpenTelemetry. There are many other supported languages, but I haven’t explored them all yet.

Most vendors support OTLP (OpenTelemetry Protocol), which is necessary for sending OpenTelemetry data. The vendors I tried had excellent guides to help get your data visible in their backends. Kudos to them for their support and making the integration process user-friendly.

One of my favourite components of OpenTelemetry is the OpenTelemetry Collector. It gathers data from all your sources in one central place, allowing you to manage everything from a single spot. The collector also enables you to enrich and manipulate your data before sending it to your backend (e.g., Grafana Cloud, Honeycomb, Datadog, Splunk, Lightstep).

For instance, you can add labels like service_name, team_name, and environment, which provide clearer insights and make it easier to track down issues.

The processors add even more useful capabilities. The k8sAttributes processor can automatically insert kubernetes specific information, like pod names and namespaces, giving better context to the data. The transform processor allows you to modify or redact sensitive data, such as masking PII or adjusting formats. The resource processor adds extra metadata or tags, while the filter processor helps reduce unnecessary logs, cutting down on data volume and, in turn, costs.

Overall, I found OpenTelemetry to work really well, and it has truly sparked my enthusiasm for what’s possible in observability.

The Lows: Challenges Faced

However, there were challenges. OpenTelemetry’s complexity can be overwhelming, especially for newcomers, and the learning curve steepens as you move from basic setups to real-world, large-scale scenarios. Some people compare it to the “Kubernetes of observability”.

Documentation: The documentation for OpenTelemetry could be more user-friendly. Compared to some proprietary tools, it can be less clear and harder to navigate and often assumes users have prior knowledge of OpenTelemetry, which can make it harder to get started.

Migration Challenges: Moving to OpenTelemetry involves more than just switching tools. It requires reconfiguring dashboards, alerts, and legacy data. Adapting existing workflows and dealing with historical data can be tricky.

Limited Knowledge and Support: Finding expert help for OpenTelemetry can be difficult. While some cloud vendors offer support, they tend to focus on their own customized versions, which can detract from OpenTelemetry’s vendor-agnostic advantage.

Finding expert support for the standard version of OpenTelemetry remains a challenge, and vendor-specific solutions often don’t fully address this gap. These challenges are common in evolving technologies, but they present opportunities for growth and learning within the community.

Key Lessons Learned

Migrating to OpenTelemetry isn’t just a switch — it’s a commitment that requires thorough planning and careful execution. This is especially true if you have custom setups or specific needs. It’s important to start thinking ahead about how to manage the migration and where to get help if needed. The open-source nature of OpenTelemetry offers a cost-effective alternative to proprietary solutions but comes with its own set of challenges.

OpenTelemetry Best Practices (Dos and Don’ts)

Dos:

Start small: Start small by testing on a limited set of services to minimize risk.
Use auto-instrumentation: Save time with supported languages.
Use the community: Seek help from the OpenTelemetry community, don’t hesitate to ask for help via for example Slack.

Don’ts:

Don’t rush into full-scale deployment: Understand components before scaling.
Don’t get locked into a single backend: Use OpenTelemetry’s vendor-agnostic capabilities.

The Future of OpenTelemetry

As OpenTelemetry continues to mature, it’s poised to become an even more essential tool for organizations looking to standardize observability.

Enhanced Documentation: Focus on more user-friendly and comprehensive documentation to help both new and experienced users. Clearer guides, more examples, and better walkthroughs will lower the barrier to entry.

Streamlined Onboarding: Simplifying the setup process will help new users get started more quickly and with less confusion. Right now, OpenTelemetry offers many configuration options, which can be overwhelming for beginners. A more guided and intuitive onboarding process — perhaps with more out-of-the-box configurations — could help users set up their observability stack faster, similar to some proprietary tools.

Broader Expertise: As the OpenTelemetry community grows, so will the pool of professionals with deep expertise in the tool. Increased training resources, certifications, and more widespread adoption in the tech industry will ensure that finding OpenTelemetry experts becomes easier. This will improve support for organizations implementing OpenTelemetry and drive more innovation in the ecosystem.

OpenTelemetry as a Service: Another exciting opportunity would be for vendors to offer OpenTelemetry as a Service. This would be extremely useful for organizations that lack the resources or expertise to implement and manage OpenTelemetry themselves. Having vendors provide this as a managed service would lower the barrier to adoption and enable more companies to benefit from OpenTelemetry’s power without the heavy operational lift.

Community and Vendor Collaboration: Collaboration between the open-source community and vendors will be key to OpenTelemetry’s continued success. Encouraging cloud vendors to align with OpenTelemetry’s standards while still offering extended functionality can help maintain its vendor-agnostic appeal. I think this will be essential for success.

Conclusion

My journey with OpenTelemetry has been a mix of highs and lows — ranging from the elegance of its vendor-agnostic approach to the power of the OpenTelemetry collector, alongside challenges like complexity, documentation gaps, and migration hurdles.

Yet, despite these difficulties, OpenTelemetry shines through for its ambitious promise: to unify logs, metrics, and traces under a single standard, addressing one of the biggest issues in observability — the fragmentation of data across disparate tools.

While OpenTelemetry isn’t without its flaws, it offers key advantages, especially the freedom to choose and switch between tools without the need to rebuild your entire setup. This flexibility is one reason why many large companies have adopted it, despite its limitations. As the ecosystem continues to mature, with improvements in onboarding, support, and performance, I’m confident that OpenTelemetry will remain a game-changer for organizations looking to enhance their monitoring and troubleshooting capabilities.

I also believe that the OpenTelemetry community is doing its best to support users, but covering every possible use case is a huge challenge. Rather than spreading too thin, it’s essential for the community to maintain focus on OpenTelemetry’s core principles.

Whether you’re just starting out or considering migrating to OpenTelemetry, be prepared for some bumps along the way. However, the benefits of embracing a unified, vendor-agnostic observability framework far outweigh the challenges. I encourage everyone to dive into the OpenTelemetry community and explore the wealth of resources available. Together, we can shape the future of observability.

If you find this helpful, please click the clap 👏 button and follow me to get more articles on your feed.

Note: The information provided here represents my own viewpoints and not those of my employer or any organization I’m affiliated with.