OpenTelemetry as a Service

A migration guide to OpenTelemetry

Magsther
7 min readSep 14, 2023

Introduction

In this post, I’ll talk about OpenTelemetry as a Service (OaaS).

Introducing a new technology on an organisation level, isn’t usually an easy task, that’s why its not uncommon for companies to end up using multiple tools that essentially do the same thing. Managing all these tools makes it hard to govern and prevents collaboration (with teams using different tools). In addition to this, these tools usually cost a small fortune (especially for small and midsize companies).

Until recently, there hasn’t been a good alternative to overcome these problems..

OpenTelemetry to the rescue

To help solve these problems, we can use this “new” phenomenon called OpenTelemetry.

Wait.. ANOTHER tool you may think? Well.. yes, but this one is different (I promise).

The OpenTelemetry project aims to provide a vendor agnostic way to instrument and collect your data. This means that we don’t require a propriety agent to run on our machines, this can now be handled by the SDKs/APIs and a collector from OpenTelemetry.

By using OpenTelemetry, we can decouple the instrumentation from the storage backends, which is great because it means that we are not tied to any tool, thus avoiding a potential buy in from commercial vendors.

We ask developers ONCE to instrument their code (without having to know where the data will be stored). The telemetry data (logs, metrics and traces) is sent to a collector that you OWN (more about that later), and from there you can send it to any vendor you like.

OpenTelemetry allows you to own the data that you generate rather than being stuck with a proprietary data format or tool.

This is a big reason why OpenTelemetry is beginning to change the observability landscape.

You can even use more than one vendor and compare them, without asking developers to change anything in their code.

If you want to know more about OpenTelemetry, I’ve written a Hands-On guide on how to get you started:

More resources can be found at the bottom of this post.

First steps

Let’s say you have decided to give OpenTelemetry a chance. You and your (SRE) team has been tasked to investigate this. The teams decides to take a systematic approach by using SDLC (Software Development Life Cycle).

They start by talking to teams to gather requirements.

Usually the biggest challenge is to find a way to make the transition as smooth as possible, which often means to not bother teams unless absolutely necessary to ensure a seamless user experience. The value that the teams are delivering every day can not be disrupted, so it’s utterest important that your team is doing their due diligence.

Collecting all the requirements and drawing up an architecture are some good first steps.

After some brainstorm sessions, the SRE team comes up with the idea of a self service tool, where teams can onboard themselves in their own time with as little threshold as possible.

The SRE team also feels that this is a perfect opportunity to finally get the chance to do proper governance and get control of the data.

They begin designing the service and that it will provide two things:

  1. An OpenTelemetry agent configuration
  2. A central OpenTelemetry Collector Service (gateways)

The agent (which is responsible for collecting the data) will be pre-configured with an endpoint, authentication and everything else needed for them to send telemetry data to the gateway. Next to the pre-configuration, the teams can tweak the configuration (e.g. adding labels, processors etc).

The other part, the “Crème de la crème” is the gateway, in which we decide to call OpenTelemetry as a Service , or short OaaS.

Once the design is done and reviewed by the stakeholders, the teams starts with the implementation.

OpenTelemetry as a Service — OaaS

This central service will provide all teams with their own OpenTelemetry collector instance. These instances contains configuration that is managed and owned centrally by for example a SRE team and serves as a gateway between the teams (agents) and the storage back-end (i.e. Grafana Cloud).

By running these instances in a deployment mode we can get advantages like rollout / rollbacks and auto scaling.

In each instance, we have receivers, processors and exporters.

A receiver is how data gets into the OpenTelemetry Collector. A receiver accepts data in a specified format, translates it into the internal format and passes it to processors and exporters defined in the applicable pipelines. You can have for example a prometheus receiver configured to receive data in Prometheus format, or a Filelog receiver that tails and parses logs from files.

A full list of supported receivers can be found here.

With the help of processors, we can enrich and manipulate the data that crosses the collector. We can for example, use the filter processor to exclude logs from pods to significantly reduce resource usage and solve cardinality issues.

Does your application sends sensitive data, that you need to protect? No problem, the transform processor can help with that.

Is your application running on Kubernetes? The k8attributes processor allows for automatic setting of spans, metrics and logs resource attributes with metadata from Kubernetes.

Processors like the batch processor and memory limiter will help you with the performance of the collector.

Remember that when you configure processors that the order matters!

A full list of supported processors can be found here.

The exporters makes sure that the data that you want to visualise and analyse are available on the storage backend. You can easily add exporters to the configuration to support more use cases; all this without asking the team to change anything in their application or configuration.

A full list of supported exporters can be found here.

Here, Team A sends metrics to Grafana Cloud, traces to Lightstep and logs to Dataset, while Team B sends traces to HoneyComb and Team C sends traces to both Signoz and Jaeger

Since the teams are running their own isolated instances of the collector, we can accommodate this by having a central service.

All of this together, ensures that the SRE team has control over the data stream by preventing senders from directly connecting to the vendor.

Can’t we just send the telemetry data directly to the storage backend from the agents?

Yes, you can do that. BUT, then you would have less control of the data and make the migrations from vendor A to vendor B more difficult, as you will be reliant of the teams to make changes to their code.

Monitoring

The availability of the OpenTelemetry Service (and its instances) is crucial, since its a central piece in the architecture. At the very least you should monitor the collector(s) for data loss. Key recommendations for alerting and monitoring are listed here.

Obviously, you need to carefully monitor your Kubernetes cluster(s). I like to use the Kube-Prometheus stack for this. You deploy it on your Kubernetes cluster and it will automatically collect metrics from all Kubernetes components. It comes pre-configured with some default set of alerting rules and Grafana dashboards.

A dashboard for the OpenTelemetry collectors can be found by going to Grafanas official and community build dashboards page and search for OpenTelemetry.

At this point, we have deployed:

  • An OpenTelemetry agent that is configured with endpoints and authentication ✅
  • An OpenTelemetry as a Service running ready to be consumed by the agents ✅
  • Monitoring of our cluster(s) and collector(s) ✅

Onboarding

Time to onboard the first team(s) to test the service.

During the first week(s) it could be practical to ship the data in parallel , that is, both to the existing and to the new one. Once you are comfortable that everything works, you can “switch off” the existing solution and only use the new solution.

Regardless of how you do it, I always recommend to start small and test how it works. At all time, be in close contact with the team(s) during the onboarding and make sure to provide tutorials, workshop sessions and URL’s to where they can visualise and analayze their data and so forth.

Maintenance

Keep your (Grafana) dashboards available that monitors all the different OpenTelemetry components. The system we are monitoring has a single point of failure, which is the Kubernetes hosts, where every instance is deployed on.

Look out for gaps in the data, cpu resources, queue length, receiver failures and the total uptime (See monitoring above).

Conclusion

In this post we introduced OpenTelemetry as a Service , or short OaaS.

If you find this helpful, please click the clap 👏 button and follow me to get more articles on your feed.

--

--