Sicredi’s Path for Opentelemetry Adoption

Igorestevanjasinski
Sicredi Tech
Published in
5 min readJun 24, 2024

Prefer to read this in Portuguese? Click here. 🇧🇷

Written by Gilberto Lupatini, Henrique Luis Schmidt, and Igorestevanjasinski

In the last few years, Sicredi has undergone a digital transformation and experienced exponential growth. As a result, migrating from legacy servers to a microservices architecture was a natural step. However, this transition made observability indispensable.

When we started our observability journey our goals were to standardize our telemetry signals, increase context and correlation, and produce quality telemetry, and choosing Opentelemetry to meet those goals was more than a no-brainer; it was strategic.

At Sicredi, we already established a good culture around metrics, using primarily Prometheus and Influx. Hence, the first signal we decided to standardize was Traces. Still, along the road, we saw that doing distributed tracing without proper standardization of libraries, backend, and semantic conventions would be difficult. So, aligning with one specific Observability framework like Opentelemetry was a MUST.

Choosing the hard path

Our first challenge was tackling instrumentation. Instead of starting with the typical approach of automatic instrumentation, we deliberately chose manual instrumentation. This decision allowed us not only to provide a solution but also to shift our observability practices leftward and promote a sense of ownership regarding observability.

One of our challenges was unifying existing metric and trace solutions, which had accumulated three versions of libraries and protocols, including Spring Cloud Sleuth, Zipkin, and OpenTracing. Given the deprecation of OpenTracing and Sleuth’s migration to Micrometer Tracing, we believed it was time to consider a replacement that could encompass all other libraries and configurations.

Since we primarily operate with Java and Spring Boot, we created an internal observability library. This library helps developers instrument their applications by configuring the Opentelemetry API and SDK, simplifying interactions with telemetry signals and attributes. Considering the heterogeneous landscape of solutions and different JDK and SpringBoot versions, creating a library that streamlines replacement seemed like the best approach. Indeed, it has proven effective in standardizing configurations and abstracting infrastructure details.

The initial version of our library replaced all existing dependencies and necessary configurations in Spring’s configuration files. Initially, we maintained compatibility with the Zipkin protocol to facilitate smoother project and infrastructure migration. In the second version, we also adopted the OTLP protocol with HTTP for signal transmission.

As more teams adopted our observability library, we expanded our support to other instrumentation libraries and frameworks, including Kafka, MongoDB, and JDBC. While not all projects have migrated yet, we are actively working on it, including the possibility of automatic codebase updates, which we’ll address in the future.

Swiss knife

Opentelemetry allows us to send telemetry data to our backends directly from our microservices using the SDK, but the Opentelemetry collector is a viable tool that enables us to control volume, transform, and add data in a centralized and scalable manner.

In the first phase, our goal with the collector was to control the volume of data exported from our application into the backend by using techniques like a probabilistic sampler and also add attributes into our span to differentiate telemetry from different locations for example an application emitting telemetry from our on-premises and an application from cloud-provider.

After that, the culture of using distributed traces to troubleshoot and understand system behavior began to grow at Sicredi. Then, in the second phase, our focus shifted from having a large amount of data available to having the right data available. We realized that more data doesn’t necessarily make our systems more observable; instead, the right data can get lost in a sea of thousands of spans generated per second. With this in mind, we decided to change our sampling strategy from head-based sampling to tail-based sampling.

Opentelemetry Collector — 2 layer diagram

Tail-sampling is a sampling method that allows us to sample traces based on policies by waiting for the trace to be completed before deciding whether to export it to a backend. Tail-sampling was a huge game changer for us because it allowed us to set policies based on latency, errors, or specific span attributes. Additionally, we were able to continue using the probabilistic sampler to retain a small percentage of traces based on the total amount. This approach made the right data available for developers while also controlling the volume of data exported.

Next Steps

Scaling Instrumentation

While we initially implemented manual instrumentation, its scalability challenges and the varying priorities across teams prompted us to consider automatic instrumentation. By using the OpenTelemetry operator and agent injection, we can automate this process, which aligns well with the programming languages in our tech stack. Although there’s a risk of instrumentation overload, the operator provides effective mitigation techniques.

This dual approach offers significant advantages: manual instrumentation remains valuable for its flexibility in customizing observability signals. However, teams lacking the time or those satisfied with the baseline provided by automatic instrumentation can forego the manual setup, simplifying their development process.

Correlation

A key development for us is to generate all major telemetry signals in OTLP format. One of our next objectives is to produce traces, metrics, and logs with uniform metadata, facilitating correlation across these signals. The OpenTelemetry operator can significantly aid us in achieving this.

We’ve already observed the benefits of correlating traces and logs. Now, we aim to enhance this by incorporating exemplars and other correlation techniques, adding richer context to our telemetry data. This will focus on empowering our teams to more effectively identify and resolve issues within their systems.

Game Day

OpenTelemetry has significantly contributed to cultivating an observability culture within our company, yet there’s still room for growth. A promising idea is to organize ‘Game Days’ to better understand team dynamics during incident response. These exercises can reveal how teams utilize telemetry signals to pinpoint issues. A notable example of a successful Game Day is highlighted in a Skyscanner blog post on the OpenTelemetry page. Insights from these sessions could be invaluable in tailoring training for developers, DevOps, and other team members based on their performance and strategies employed during these simulations.

It is clear that Opentelemetry has been an extremely important factor in our observability journey and as the project evolves we will continue to evolve together.

--

--