Building a large-scale Observability Ecosystem

Juan Pi
Mercado Libre Tech

--

The health of our platform and operational excellence are essential for ensuring a great experience for our customers.

When any of our services are affected, and it is not possible to complete a checkout process in our marketplace cart or pay using Mercado Pago, our on-call engineers work swiftly to resolve these issues. To tackle these problems, we often ask ourselves, “What is failing and why?”

In our payment processing services alone, there can be over 100 microservices and more than 1,000 different operations involved. These scenarios present a significant challenge because the complexity of the system increases the number of potential sources for the issue. Complex systems require robust observability (o11y)!

Answering these questions can be complex when troubleshooting distributed systems.

In this article, we will share our approach to observability, the main tasks of the platform team responsible for it, what observability is in the context of Fury, and some lessons we have learned along the way.

What do we talk about when we talk about observability?

In recent years, the term “observability” has become a highly relevant topic and has garnered significant interest within the tech community.

But what exactly do we mean by observability?

Originally introduced in control theory, observability is defined as the measure of how well the internal states of a system can be inferred from its external outputs. This term also applies to software. When we say an application is observable, it means we can:

  • Understand its inner workings through observation and inquiry using external tools.
  • Understand its state, including unpredictable states.

A core aspect of observability is the possibility of exploring applications in open-ended ways. Explorability means asking questions iteratively to understand the state of an application or system, without having to predict that state in advance.

Observability is crucial for tackling challenging problems, especially in large-scale distributed systems. It’s essential for scaling operations sustainably across Mercado Libre. To conduct effective exploratory processes and efficient observability-based debugging, it is essential to have a thorough understanding of context. Preserving as much context as possible around each request helps reconstruct the environment and conditions that led to a failure.

It’s important to note that observability begins once telemetry is collected, and code instrumentation is how we gather this telemetry. Instrumentation collects telemetry data from any source, whether open-source or vendor-specific, that produces data.

There are different categories of telemetry data such as logs, events, metrics, traces, and profiles.

Image 1: Observability is not logs, metrics, and traces! Observability is about asking questions.

Enough theory, let’s deep dive into Mercado Libre’s observability journey!

Starting out: A crash intro to our first steps

Back in our early days with Fury, the observability initiative was not yet formalized within the organization. We had a set of tools that met our initial needs and helped us address the problems we faced.

However, as Fury adoption expanded and our technology stack evolved, we encountered new challenges.

Firstly, the rapid growth of our microservices architecture brought inherent complexities, scale, and volume, leading to the need for new telemetry signals and new tools.

Secondly, a wide array of observability tools emerged, including third-party providers, open-source solutions, and proprietary in-house tools.

These aspects increased the complexity of the exploration processes by introducing challenges in correlating different data types. It also increased operational complexity, requiring the management of multiple telemetry agents and a variety of code instrumentation libraries. Additionally, the proliferation of multiple telemetry sources added complexity to control and management tasks.

This lack of standardization also prevented the use of common semantics across different signals and tools, meaning there was no specification of common names for different kinds of operations and data. Essentially, telemetry data lacked standardization.

Another critical aspect is the cost control and governance of solutions. As application volume, traffic, and scale increased, so did the volume of telemetry data produced.

This growth directly impacts costs and introduces complexity in data utilization. High volumes of telemetry data often result in noise, significantly affecting incident detection and resolution times.

Welcome aboard, Observability team!

To address these needs, we established a dedicated platform team for observability. The primary mission of this new team is to provide a reliable, sustainable and well-managed observability ecosystem that enables all the teams at Mercado Libre to understand and learn about their productive environments, no matter how complex they may be, in order to solve problems faster.

Essentially, we built the Observability team with two main focuses in mind: enablement and efficiency.

It is important to highlight that the Observability team focuses on laying the necessary foundations to support a strong observability culture, rather than simply replacing it. In other words, we encourage observability to be integrated into the daily practices of every software engineer as part of the development cycle.

Within the enablement tasks, the team is responsible for the telemetry instrumentation components, SDKs, agents, ingestion, collection, transportation, and storage. It is also accountable for providing exploration, analysis, and visualization tools for telemetry data.

To ensure these components contribute to a strong observability culture, the team also focuses on training tasks, knowledge sharing, consulting, and collaboration with other teams.

The team is also responsible for defining the necessary tools and collaborating in the “build vs. buy” decision-making process. This is where the team’s second focus, efficiency, comes into play. The Observability team looks after the efficiency of the ecosystem by ensuring the healthy growth of solutions, implementing cost control mechanisms, and optimizing the use of resources.

Fasten your seatbelts: The O11y journey begins!

With the team in place, our first step was a comprehensive assessment to understand the state of the art. We developed an observability maturity level with clear, realistic goals, and evaluation criteria aligned with Mercado Libre and Fury’s vision. This approach provided a clear starting point to visualize our current situation and align expectations.

In other words, we considered this as a framework that would provide us a roadmap to understand the current capabilities, identify areas for improvement, and define our strategy to achieve optimal observability.

To develop the observability maturity level, we started with general capabilities such as our levels of reliability, resilience, our MTTx (Mean Time To) metrics, the management of complex systems, and more specific attributes such as code instrumentation, availability of telemetry data, correlation, and visualization of data.

This allowed us to understand, among other things, how teams used different tools, their analysis and troubleshooting processes, as well as identifying gaps in visibility or functionality. Conducting this assessment required connecting with diverse teams, backgrounds, and seniority levels.

A fundamental aspect was the collaboration with the SRE teams and the active participation in incident resolution, which allowed us to understand and define the operational requirements much better. Likewise, collaboration with the FinOps team offered a clearer perspective on governance and efficiency requirements.

In short, interaction and collaboration with different teams were key to building a shared vision and developing a roadmap for our journey.

Paved road: Platform teams’ shared vision

To provide additional context on the requirement criteria and challenges that formed the basis of the previous process, many of these requirements and challenges are not exclusive to the Observability team. When abstracted from the specific business domain and viewed from a more macro perspective, they apply to many different problems of the Platform teams within Fury.

We have a shared vision for Fury: to provide a consistent and unified development experience, simplify complexity, build reusable solid components, and make it faster and more efficient. You can read more about our DevEx in this article.

Below, we outline some of the requirements and challenges to build a paved road on observability.

Out-of-the-box experience within Fury

  • Provide automatic foundational observability features from the platform.
  • Provide automatic instrumentation and pre-configured views for data exploration or pre-configured alerts.
  • Offer abstraction and automation in managing infrastructure components such as telemetry agents.

Simplifying complexity and reducing effort for Fury users

  • Provide capabilities with minimal adoption effort.
  • Minimize the complexity of technological components and tool migrations, ensuring transparency for developers.

Telemetry signals, correlation, and availability

  • Provide necessary telemetry signals on the platform tailored to various use cases.
  • Offer correlation capabilities to enhance analysis processes.
  • Ensure compliance with targeted objectives regarding data availability.

Service governance and efficiency

  • Establish service governance criteria to enable sustainable scaling over time.
  • Reduce complexity and noise.
  • Implement mechanisms to enhance efficiency levels.

Risk and dependency mitigation

  • Mitigate risks associated with vendor lock-in from suppliers and tools.

Journey stopover spots

Tooling overlap decoupling

When multiple tools are provided within a single ecosystem, there is a risk of overlapping use cases among different tools. This overlap can confuse users about which tool to use for each specific use case, potentially causing delays in telemetry data exploration and impacting incident analysis times. Moreover, it can result in governance issues and additional costs.

To solve this problem, we conducted an analysis to choose the most suitable tool for each use case based on its specific strengths. Subsequently, we established clear guidelines within a framework and implemented governance mechanisms to mitigate the risk of tool overlap.

Tooling and services scaling

As previously mentioned, in our initial stages of developing Fury, we already had a suite of observability tools available. However, the exponential growth of new microservices and traffic on the platform directly impacted some of these tools.

As a result, it became necessary to focus on improving the scalability of our tools and services. One of the most significant tooling efforts was the redesign of our logging solution. Ensuring clear visibility into logs is crucial due to the contextual information they provide, especially during error occurrences.

To achieve this, we evolved our ingestion and collection components and redesigned our storage layers. We built management tools to automate the control of the write, read, and storage layers.

Embracing standardization

One of the most important decisions we made was to focus our efforts on standardizing telemetry data. To achieve this, we decided to adopt OpenTelemetry.

OpenTelemetry is an observability framework and toolkit designed to create and manage telemetry data. It is vendor-neutral, open-source, and tool-agnostic, meaning it can be used with a broad variety of observability backends.

There are numerous motivations and benefits for adopting OpenTelemetry. The most significant benefit includes taking ownership of the data generated to mitigate vendor lock-in risks, enhancing flexibility and extensibility for data manipulation and correlation, and establishing a unified set of APIs and conventions. This simplifies operation and defines a common naming scheme that can be standardized across a codebase, libraries, and platforms.

Filling the gaps

Another key focus of the team was to address the visibility gaps detected. To do this, we worked on making two new telemetry signals available within Fury: Distributed Tracing and Profiling.

Distributed tracing refers to methods for observing requests as they propagate through distributed systems. It is a diagnostic technique that reveals how a set of services coordinate to handle individual user requests. In microservice-oriented architectures, distributed tracing plays a crucial role in pinpointing where failures occur and identifying the causes of poor performance.

Profiling involves analyzing the performance characteristics of a software application or system to identify performance bottlenecks and areas of inefficiency. This allows developers and operations staff to make informed decisions about improving performance. Within Fury, both ad-hoc profiling and continuous profiling are possible.

Image 2: Telemetry signals

Offering these capabilities within Fury required development effort and availability of toolkits and SDKs, building the collection and storage layers, and ensuring seamless, low-effort adoption for users.

We also focused on creating a “First Pane of Glass” (also known as Single Pane of Glass), which serves as a platform providing centralized, cross-company visibility into various sources of telemetry data.

This solution democratizes information by reducing the analysis overhead for teams and provisioning the necessary information to kickstart investigations into potential problems.

Governance and efficiency

One of our initial actions in this track was to establish a service framework for each component and tool. This involved setting clear criteria for guarantees, requirements, ingestion, storage, and retention quotas.

To establish these guidelines, we had to study the distributions of data generation and consumption across the different applications in Fury, define use cases to exclude, and define a strategy for dealing with noisy neighbors.

Some questions that have served as triggers in this process included: Is all data equally important? Is it necessary to offer the same retention policies for all cases? How often is data consumed? Should traffic spikes be managed in a particular way?

After defining each tool, we set the desired consumption targets and annual growth rates. Finally, we developed all the sampling mechanisms, quota management, and required levers to achieve these objectives.

Conclusions

With over 40TB of daily logs, over 200M spans per minute, and over 250M metrics per minute, our observability ecosystem is a crucial component within Fury. It empowers teams within Mercado Libre to better understand our customers’ experiences and effectively troubleshoot any issues that may arise.

In future blog posts, we will dive into technical architecture and share the results and lessons learned from this journey.

--

--

Juan Pi
Mercado Libre Tech

Senior Engineering Manager | O11y & Performance Engineering