Enhancing Developer Efficiency: Our Journey with OpenTelemetry Tracing

Agoda Engineering

Published in

Agoda Engineering & Design

8 min readJun 24, 2024

by Oliver Belfitt-Nash

At Agoda, we operate thousands of systems, some significantly larger than others, making code deployment particularly challenging for the biggest ones. Monolith services contain different business domains within a single repository. As infrastructure standards evolved, these monolithic services were left behind. When a new contributor had to commit code, they would often multiply their time estimates due to the complexity. Once a change was finally completed, pipelines took hours to process, and deployments occurred only on a weekly schedule. Even a simple task like adding a log message could result in days of waiting before reaching production.

This was a significant issue, and our solution was to break the monolithic code into smaller, manageable services, aiming to reduce the Mean Lead Time for Change (MLTC) in line with Google’s DORA metrics. Fortunately, most systems now have lead times measured in minutes rather than days.

https://services.google.com/fh/files/misc/2022_state_of_devops_report.pdf

The original problem has been largely resolved. We now manage multiple bite-sized services, each with a modern CI flow and up-to-date dependencies that can be changed and deployed on demand. Developer satisfaction with these services has risen significantly. Even if a developer is unfamiliar with a new domain, they can clone and run the service within minutes. Working with a fresh, small codebase allows quick and efficient changes without the burden of navigating through thousands of lines of legacy code.

The Challenges of Distributed Systems

However, with one solution comes another set of challenges. Where there used to be one monolith, we now have a distributed set of services managed by many teams. While these services are smaller and allow for agile, swift changes, it can be difficult to identify ownership.

When something goes wrong, how do we identify the root cause? Where there was once one system to investigate, there are now ten. How can we determine which piece is causing the issue?

To gain a comprehensive view of a distributed system with many smaller components, different tools are required. We can no longer consider a single repository our sole responsibility. We need to understand what our dependencies are doing and what their dependencies are doing to see how everything fits together. A single-user HTTP request might be processed through ten systems in four different languages before responding.

Example

For example, my team manages a service that renders a page allowing hotels to view a report on their performance. To render this page, the browser calls its reporting BFF (Backend for Frontend), which in turn calls three services:

IAM (Identity Access Management) Service: Responsible for authorising the user to access the page.
Hotel Performance Service: Responsible for retrieving aggregated data on the hotel’s recent performance.
Geography Service: Responsible for checking recent trends in a specific geography.

These three services exist separately, each with dependencies that the page owner does not have visibility into. For example, the IAM service calls the IAM database to check encrypted authentication records. The Hotel Performance service calls a Booking API to check the hotel’s recent bookings, and the Geography service reads a specialized data cluster to obtain its data. This process looks like this:

IAM (Identify Access Management) service

Calls IAM DB

2. Hotel Performance service

Calls Booking API

3. Geography service

Calls geo cluster

The request journey does not end here. The Booking API itself calls a set of services responsible for calculating the financial details of a booking, which in turn calls a config service.

4. IAM (Identify Access Management) service

Calls IAM DB

5. Hotel Performance service

Calls Booking API
Calls Booking Finance Calculator
Calls config service
Calls config DB

6. Geography service

Calls geo cluster

And so on. This is a relatively simple example of a single flow, but in a complex environment with thousands of services across Agoda, it can take a significant amount of time to understand what is happening when the page loads.

If the page stops responding one day and users encounter an error message, it creates a negative user experience and could lead to dissatisfaction with Agoda. This is a war room situation, which means any team related to the issue must stop what they’re doing and immediately focus on resolving the problem. War rooms can happen outside working hours and take precious time away from other product work.

While reviewing the code we are familiar with, everything may seem correct, indicating the problem likely lies within a dependency. However, with numerous smaller dependencies, pinpointing the true source of the issue can be challenging, given the many potential causes.

What is Tracing?

Tracing offers a solution to this problem. The core concept of tracing is to consistently add metadata to every request passing through a distributed system and maintain a record of each service a single HTTP request interacts with. Similar to how a stack trace provides a record of every function call in the stack, an HTTP trace provides a comprehensive view of every service the request encounters.

“Traces give us the big picture of what happens when a request is made to an application. Whether your application is a monolith with a single database or a sophisticated mesh of services, traces are essential to understanding the full “path” a request takes in your application.”

Tracing is not a new concept, but there has been disagreement on the best way to implement it for years. This disagreement led to the development of several different standards. When multiple systems implement tracing differently, much of the value is lost. If the tracing specification of one system differs from that of its dependency, it becomes impossible to see into that system. The real value of tracing emerges when everyone agrees on a standardized method for tagging the necessary data across services. After much debate, a single unified specification for tracing has recently gained widespread acceptance.

Introducing OpenTelemetry

For us, OpenTelemetry has emerged as the clear choice. This eliminates debates over which header to use, how to name trace spans, or how to structure the data sent. The specifications can be referenced at OpenTelemetry Documentation or, even more conveniently, using the available OTEL libraries and agents. What previously required meticulous alignment across many teams often comes down to adding a single line of code to any service.

Behind the scenes, this applies the OpenTelemetry default trace headers to inbound HTTP requests into the service and propagates them through it, attaching them to outbound dependency calls. When a downstream system receives the request, it recognizes the existing trace header and forwards it to the next dependency, thereby maintaining a single trace ID across multiple applications. This trace information is then sent to a central server that stores it for viewing.

Benefits of OpenTelemetry Tracing at Agoda

Now, when an error occurs on a page, it is possible to locate traces related to that page and identify the exact dependency that caused the error, even if it is in a service three dependencies away and previously unknown. The clear waterfall diagram displays all related calls across all microservices:

Additionally, for the internal subnet, the trace ID is included in the browser response, allowing customer service agents to see the exact trace ID causing the error on their screen. This enables developers to look up the specific trace that caused the error observed by the customer service agents.

And this is just the beginning of what tracing offers out of the box. The libraries are extensible and can accommodate the various ways an application may need this data. For more complex systems, trace span details can be added for individual method calls within a single service, appending crucial data that can save time during investigations:

Log Links

Using a single source of truth for telemetry unlocks a new level of observability at scale. OpenTelemetry enables the integration of logs, metrics, and traces, as they all adhere to the same specification. At Agoda, we leverage this to build a custom link from log messages to their corresponding traces. When viewing a log message, a convenient button links directly to the trace:

Conclusion

Our journey with OpenTelemetry is just beginning, but the tools we have implemented have significantly enhanced our ability to monitor services and address issues. Observing and managing services at scale requires consistent standards across the board to realize their full potential. With OpenTelemetry, we now have that standard and can start building upon it.

Tracing offers the most significant value from these telemetry standards, with logs following suit and metrics providing valuable insights when linked together. If every customer issue includes logs, metrics, and traces bundled with the support ticket, we can spend less time fixing bugs and more time developing new features. This level of observability enables developers to optimize their software efficiently without creating custom solutions for each potential issue.

However, it’s important to note that implementing OpenTelemetry does come with some challenges. It requires a significant investment in terms of time and resources to set up and maintain the necessary infrastructure. Additionally, there may be a learning curve for developers unfamiliar with the concepts and tools involved in distributed tracing.

Despite these challenges, the benefits of OpenTelemetry far outweigh the costs. By adopting this standard, we can improve the reliability, performance, and user experience of our services, ultimately leading to better outcomes for our customers and our business.