Insights from building a scalable distributed tracing platform for adidas

Juan Luis Gonzalez
adidoescode
Published in
5 min readNov 17, 2023
Foto de Robynne Hu en Unsplash

Why Tracing, why now.

Imagine you are a developer, Ops or DevOps specialist. You have a ticket assigned to you for a strange performance problem. What would you start looking? Where is it happening? Is it within your service? Is it somewhere else? The issue might not even reside in your service.

Microservices architecture arrived like a wrecking ball about a decade ago, bringing many benefits and also significant challenges. And certain changes derived from that are here to stay: modern applications are now a complex mix of languages, locations, cloud services, etc., all working together. It is crucial that we understand what’s happening within our increasingly complex ecosystem. And we can’t rely solely on logs to fix it.

Foto de Sašo Tušar en Unsplash

The Third signal

Traditionally there are three pillars in observability theory: logs, metrics, and traces.

While metrics and logs already had some standardisation, tracing was the most immature.

The community recognized the challenges and came to establish a standard: OpenTelemetry. It was selected as the standard by the Cloud Native Foundation. This initiative integrates logs, metrics, and traces under a unified open-source umbrella.

What we had already

Here in adidas we had already implemented a platform based on OpenSearch and Prometheus (Victoria Metrics) for our logs and metrics. Centralised storage and power to the teams to manage their data. The solution has been highly reliable for the last few years and we wanted to include the traces on it.

And even for tracing we have a third-party solution that provides an auto-discovery observability system, we wanted to build an open-source solution that could leverage the centralised logs and metrics. And to provide teams with greater power and flexibility in managing their data.

Our solution

Foto de Scott Graham en Unsplash

Open Telemetry

During the process of building the platform we found out some limitations on the documentation online. It is complete, but not very specific on the breaking changes between versions. Additionally, some use cases were not ready or supported, even though they seemed obvious based on the configuration and documentation.

Identifying the main features, workflows and transformations that your traces should have and creating POCs to ensure that they are actually viable was a key part of the process.

Grafana Tempo

But how do you view your traces? You need a tool for that. After researching various options for trace visualisation, we chose Grafana Tempo. It’s one of the most popular tracing visualization tools, fully integrated into Grafana dashboards. Having it within an application we all know and use is a significant advantage.

Grafana Tempo integrates seamlessly with Grafana, allowing you to search based on different parameters to find your results. This visualization tool is the backbone of our trace tracking system and can be used to create dashboards and set up alerts.

After making some research on the technology, standards, and our own needs, we developed the architecture:

What have we built here

1 — The teams should be responsible for instrumenting and configuring their traces and the agents to transport it. They need to own their data. Their agents should have minimal impact on the data processing. Try to delegate the data process to the central pipelines or the code instrumentation.

3 — There are many post-process activities like filtering or sampling that can have some impact on the performance. As a platform, we aim to provide a stable entry point for all the teams’ traces. They can manage the configuration of their pipelines.

4 — The traces will travel safely to the backend. Grafana Tempo in this case.

What we have seen so far

Foto de Kelly Sikkema en Unsplash

Early adopters to play along with

We quickly released a staging version of our platform and started the ‘Early adopters’ program.

This is a program that allows the teams around us, the potential users of the Platform to try it at no cost. They accept the limitations of the platform’s current state but they can also benefit from and learn about the proper use of tracing.

We gain insights into the needs and requirements to adjust the infrastructure and operational model.

The problem with the version of Victoria Metrics.

A big benefit of having the three observability signals in one single place is the possibility to navigate from one to other. You can navigate to a trace that was being executed at the same time as that unusual latency problem.

This involves a new format for the metrics: exemplars. A metric with a trace associated. It’s compatible with Prometheus, but not all Prometheus-based systems include it.

In this case, Victoria Metrics is not yet compatible, and it’s not on their immediate roadmap to include it.

The recharge.

We internally recharge for the usage of our platforms. This is a good practice in order to ensure optimal resource usage by the teams and to enable cost transparency. To achieve this, we needed a straightforward method to apply charges for platform usage to users/teams. We implemented a quota-based or ‘sizing’ system that allowed us to achieve several objectives:

  • Hold users responsible for the load they generate.
  • Adjust the infrastructure in advance to the peak loads.

The retention time

Quota-based recharges worked for our logs and metrics platforms. The teams decide the parameters to fit their needs. One of these parameters is the retention time. How long do you want your data to be stored?

Some want logs for just a couple of months, some need them for at least 12 months to have annual reports.

We found out that with traces that does not work. Once the trace is stored on the backend, the ownership is shared. It can be the result of spans from totally different teams, and deleting some spans before others would result in incomplete spans.

Maintain a common retention time for all. Try to listen and learn about the needs from your teams to determine the right magic number.

Conclusion

Each team and department has its own peculiarities. We did not intend to provide many solutions with this entry, but maybe we have helped to make right questions, that is the best way to have the proper answers.

The views, thoughts, and opinions expressed in the text belong solely to the author, and do not represent the opinion, strategy or goals of the author’s employer, organization, committee or any other group or individual.

--

--