Internal Traffic: What a crazy airport!

Fernando Navarro
Mercado Libre Tech

--

In our previous article, we introduced Fury, our internal development platform. In this article, we’ll explore how we approach the management of the traffic inside our platform, how we navigate through chaos, and how we completely revamped the way our microservices communicate.

Our internal traffic: A chaotic airport

Our internal traffic in Fury was quite similar to a chaotic airport where we have millions of passengers traveling to thousands of destinations. Everyone travels without passports, immigration controls, or customs checks.. Can you imagine?

Millions of passengers and thousands of destinations: scale challenges

At Mercado Libre, we have a huge microservices ecosystem. Our passengers who travel (or request) during peak access times exceed 400M per minute for internal traffic alone. With over 30k applications running and divided into more than 75k groups of instances, we have hundreds of thousands of instances and pods to handle all the demands. Therefore, any solution we implement or changes we make has to consider this scale.

No passports: authentication challenges

Many of our apps used to operate without proper authentication, which means we couldn’t verify the authenticity of the source. It allowed anyone inside the platform to use any of our APIs without any form of control. Incorporating authentication right from the beginning is crucial, as it enhances security, observability, and traceability .

No immigration controls: authorization challenges

In this chaotic airport scenario, the lack of immigration controls poses yet another challenge. Without passports, we cannot create access rules based on specific customers since we are unable to uniquely identify them. This raises key questions: Can anyone enter freely? What items are permitted for entry? Are there any limits or quotas on access frequency? Answering these types of questions in a wild airport like this makes no sense.

Also, in this airport or any other airport that prioritizes security, we typically encounter customs checks after passing immigration controls. These checks verify that the transported information aligns with destination policies. Can you enter with this information from this origin? Can you take this information to this destination? Does the destination have adequate security controls to handle sensitive information? Is the transfer of this information truly necessary? Unfortunately, without passports, access controls, or any form of checks, such questions lose their meaning.

How to add a new destination: centralized routes specification

What if you wanted to add a new destination or modify one of the routes of a flight? Well, in this airport every route and destination that exists is written in a single place. Every change, whether it’s adding, deleting, or modifying a destination, must be made in this central notebook. We have defined more than 100K routes. Can you imagine the size of this piece of information and the challenges involved in maintaining it?

All roads lead to Rome: centralized communication

Another singularity of this crazy airport: all flights seem to have the same destination. Even when traveling from one airport to a nearby one, passengers need to make a stopover at the central airport before finally reaching their desired destination.

This central airport serves as the convergence point for not just your flight, but for all other flights as well. From the perspective of the destination, it appears that all flights originate from the same source. This is a characteristic of having centralized traffic: all flights seem to have the same destination and origin.

Imagine the bustling atmosphere of that central airport, always busy and noisy, with runways saturated with planes. The control tower is constantly overwhelmed with routing all the necessary traffic.

In the midst of all this chaos, adding specific security controls for each application would only contribute to the confusion. The airport urgently needed an upgrade and modernization. The process of implementing these changes is an adventure in itself, and we are excited to share that journey with you.

Initial Fury Traffic Architecture

The high-level architecture depicted in the image below was quite simple. From a request perspective:

1) Accessing the platform via a centralized Traffic Layer.

2) Requests are routed from Traffic-Layer to the destination application.

3 & 4) When one application needs to communicate with another, it sends a new outbound request through Traffic-Layer, which then routes it to the intended destination.

Image — Initial Fury Traffic Architecture

Security controls improvements

To tackle this, our first step was to relieve the burden on this bustling airport by distributing the new security controls to local airports.

Our local airport serves as a traffic sidecar that runs alongside the application. It functions as a reverse proxy, capturing incoming traffic, and as a transparent proxy, capturing outgoing traffic.

The traffic sidecar enabled us to have a presence across all applications in our Fury ecosystem. Moreover, this was the kick-off that made it possible to have better control, observability, and governance of the traffic in our ecosystem, and provide new features to our developers.

Identify our passengers

Before implementing controls, we need to issue passports to uniquely identify our customers. To address this, we implemented a Certificate Authority (CA) that generates certificates or passports during the creation of the services or scopes and renews them with each deployment.

Once the passports are issued, this control system stamps each passport upon flight departure. Then, upon arrival at the destination, the same control validates the stamp’s validity. In this way, any flight without a passport or with an invalid passport can be intercepted before reaching the application. Essentially, all traffic undergoes stamping at its origin and validation upon reaching its destination.

In the first version, our passport was a plain text with a personalized format and carried a signature. The second version was a JSON Web Token (JWT). In both cases, the CA had a private key that was used to sign it, and all the apps in the ecosystem knew the public key that allowed for validation. In the third version, we took advantage of Mutual TLS (mTLS). This was resolved by the traffic sidecar in the communication between the apps, securing communication in both directions. It also avoids the need to transport the JWT and allows us to validate the identity of the origin from the certificate.

Traffic Authorization: Add migration controls

Once the validity of a passport from a trusted source is confirmed, the subsequent critical phase involves implementing passenger controls. This entails ensuring that each passenger meets the necessary criteria for travel and possesses all requisite documentation to facilitate entry into their intended destination.

In this context, the first step is to ensure that immigration regulations are organized effectively somewhere. This involves delineating the specific locations passengers are authorized to enter and understanding the immigration prerequisites of each destination. To achieve this, we developed a tool known as Traffic Authorization that allows our developers to specify the endpoints, methods, and domains accessible within an application, as well as those required for interaction with other ones.

Traffic Authorization takes that information, provided by our developers, and acts like a Migration Governance Entity, defining immigration policies and disseminating this information to various components. We have effectively cataloged the internal traffic of Mercado Libre, allowing us to trace the origin and destination of each flight. Looking ahead, we intend to enhance our capabilities by monitoring the specific data exchanged within these flights as well.

The second step involves enforcing the immigration controls themselves. Initially, custom validations were developed to operate at the destination using Nginx with Lua and Golang customizations. Upon migrating the Nginx sidecar to the Envoy sidecar, we capitalized on the opportunity to standardize this process and employed Envoy RBAC filters for these validations. They act as immigration police.

Flight plan improvements

Next, our attention turned to efficiency. As we observed, all roads lead to Rome, so the flights had more stopovers than necessary. Therefore, we decentralized route configuration and communication processes.

Traffic Routes: our decentralized routes configuration interface

To ensure seamless operation, we developed Traffic Routes, a robust routing solution. Within Traffic Routes, each development team is empowered to create and manage their own domain within our system. This domain represents their space within the airport, where they have autonomy over routing decisions. Routes are defined based on paths, and teams can designate specific paths within their domain and map them to various destinations.

Teams can set conditions based on request headers, query parameters, or HTTP methods to ensure that each request is directed to the appropriate destination within their domain, and they can also test their routes before going to production.

With this system in place, we’ve transformed our microservices infrastructure from a crowded, single-file configuration nightmare to a well-orchestrated symphony of autonomous domains, ensuring efficient routing and smoother operations for all teams involved.

Service Mesh: our decentralized communication

The service mesh comes to the rescue by reducing traffic jams: for example, removing the balancer and centralized traffic layer from the equation. Without leveraging a solution like the service mesh, we wouldn’t have been able to achieve our goal of decentralizing all traffic.

In the context of building a local airport, the mesh serves as the control tower. It ensures the successful arrival and departure of flights, directing them toward secure or less congested routes, selecting the best available route for each flight’s departure.

To implement the service mesh, after analyzing several options (including in-house development), we chose the Istio Control Plane since, among several factors, it was a mature, open-source solution that provided the features we required.

For this task, our focus extended beyond simply determining the optimal method for installing the component to align with the Mercado Libre’s requirements and scale. Additionally, integration into the entire Fury ecosystem was crucial to ensure comprehensive support for all existing traffic features. This integration involved several tasks, such as incorporating it into the creation and deployment processes of applications and instance groups, implementing the aforementioned security controls, integrating it with Traffic Routes to facilitate routing information exchange, configuring it to maintain parity with existing metrics and traffic logs, among others.

Current Fury Traffic Architecture:

The current high-level architecture, as depicted in the image below, has evolved quite a bit. From a request perspective:

1) Access to the platform is initiated through the centralized Traffic-Layer.

2) Requests are then routed from the Traffic-Layer to the destination application.

3) The request is intercepted by the traffic sidecar, which then forwards it to the application.

4, 5) In the event of communication between applications, a new outbound request is intercepted by the traffic sidecar and then directly routed to the next destination.

6) Once again, the request goes through the traffic sidecar before reaching the intended application.

Image — Current Fury Traffic Architecture

What’s next?

Our next step is to build upon the progress we’ve achieved so far. With the establishment of decentralized local airports, strengthened security controls, and enhanced information handling capabilities, we’ve markedly improved our internal traffic management. This advancement enables us to move away from maintaining a central airport with insufficient security measures and generalized controls for each application. This transition empowers us to focus on optimizing our operations, improving efficiency, and providing a more secure and seamless experience for our users.

Stay tuned for our upcoming articles!

Thank you for flying with Mercado Libre airlines!

--

--