Data Mesh @ MELI: Building Highways for Thousands of Data Producers

Ignacio Weinberg
Mercado Libre Tech
Published in
9 min readFeb 23, 2024

--

Imagine yourself in a dream where you are stuck in traffic. You are commuting to the city, and to reach your destination, you and other commuters must travel through a single road. This road has been expanding over the years, but the number of cars trying to reach the city has been growing at an even faster rate. As a result, you are stuck in slow-moving traffic. However, you are also privileged; there are many cars that couldn’t even make it onto the road.

As you wake up, you realize that your Data Engineering team has become that vast and expansive road. The cars traveling on this road represent the data products, while the destination they aim to reach symbolizes the official data warehouse, lakehouse, or data repository of your organization. This destination serves as the hub for making data-driven actions and decisions. The road vividly portrays a saturated and unique prioritization queue, reflecting also the frustrations that come with it.

The scenario depicted above was indeed happening at Mercado Libre. Centralization is not inherently bad, and it had been a winning strategy for our data management and productivity for years. However, the landscape was changing. Let us explore the context and the main factors that led us to that collapsed road.

Mercado Libre (MELI) is a digital native company that encompasses multiple companies within one. It is not only the largest and leading marketplace in Latin America (with more than 54 million buyers making 52 purchases per second), but also an expanding ecosystem of products and services. It includes a Fintech with a digital wallet and payment system serving 53.1 million customers and processing 371 transactions per second. Additionally, MELI has a vast shipping and logistics network unmatched in size and performance across Latin America. Approximately 52% of our shipments are delivered within 24 hours.

The number of people working at MELI has been growing exponentially. In December 2019, we were almost 10,000 employees, and by December 2023, we had reached 58,000. To put it simply, MELI is like many large and continuously expanding companies combined into one entity.

MELI is a highly agile, innovative, and dynamic organization. Our data models, which aim to describe aspects of our reality, are very dynamic too. In an exponentially growing company, with exponentially growing data and needs, sooner or later the single-team data strategy would eventually struggle to keep up with the pace, and would need to adapt and reinvent itself.

As I mentioned earlier, we had come to a point where our centralized team had grown rapidly, but a team of over 50 data engineers wasn’t enough to meet the large number of requests. It is not surprising to say that they were not having a good time, and frustration was regularly present both among our team members and also among the teams requesting new data products. These symptoms should sound familiar to anyone who has read Zhamak Dehghani’s book “Data Mesh” (Dehghani, 2022)¹ or her article “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh” (Dehghani, 2019)². If you haven’t read them, we highly recommend doing so.

In the next sections, we will explore how we addressed these challenges and transformed our data engineering practices to adapt to the evolving needs of our organization.

Starting out

Around the ending of 2021, with a growing number of people joining MELI with the skills and capacity to produce data on their own, we began seriously considering the adoption of a Data Mesh approach. This decision was driven not by a passing trend, but out of pure necessity. We were aware of the willingness of users to become independent and produce their own data. Additionally, we realized there were two other critical aspects in MELI that encouraged us to take the next step, and felt we had the readiness for it: Culture and Technology.

It is no coincidence to mention Culture first. Generating a culture shift is usually much harder and challenging than implementing a technological one, as the former provides the fertile ground needed for the latter. MELI has been developing and cultivating self-service and data-driven culture for years. During this time, several teams have matured and evolved as well as their practices and skills. These teams were now not only consuming data, but also becoming informal data creators. They were not creating or publishing official single-source-of-truth data, and may have had their own good practices and standards. On the other hand, they were already skilled for the challenge, and some of them were also already asking for a way to formalize their data products.

On the technological side, we were in a favorable position. We had developed our own modern data platform called Data Suite, which had 4,000 monthly users and 35,000 monthly sessions at the time (now 11,000 monthly users and 140,000 monthly sessions). The platform had been designed for distributed teams, with most of the tools being built in-house. The key components within this platform for this particular project were a data pipeline orchestration tool (with 100,000 daily job executions then, and now 270,000 daily job executions), and a data catalog or metadata service (with 300,000 data artifacts cataloged then, and 1 million today). These were complemented by our machine learning platform called Fury Data Apps or FDA (with 10,000 tasks running daily at the time, and 20,000 tasks running daily today) and Monte Carlo data observability tool monitoring every single productive table in the data lakehouse. Further, we had recently migrated our data lakehouse to Google BigQuery. We were highly satisfied with these tools as they also provided us with features such as data lineage, SLAs and data uptime metrics. Nonetheless, we knew we would have to make some changes to fit our new strategy. The advantage of having most of these tools developed in-house was that we were able to change and adapt them as needed.

In the process of designing our own decentralized implementation, we always prioritized pragmatism over strict adherence to theory, for minimizing risks and taking good care of MELI’s data. Data Mesh, as you may know, is not an “installable”. We had to come up with our own interpretation and adapt everything to MELI, which was really a big bet, but we were very confident that our team could pull it through. I’d like to share with you some key technical design decisions that may be different or particular to our implementation.

Differentiation between Domains and Data Mesh Environments (DMEs)

As you start designing your Data Mesh approach, driven by the first principle of “Domain-driven data ownership”, you will find the word “domain” everywhere. Defining the aspects and boundaries of a business domain to produce its own data products in MELI was challenging enough. Domains may not be discrete items on a list; they may be hierarchical and organic entities that may be born any day, grow, split or disappear overtime. Our task was to take all those characteristics into consideration and define the technical characteristics for the nodes on the mesh.

So at one point, there was an ambiguity or confusion when we were referring to the word “domain”. Were we referring to the business domains? Or were we talking about the environment and rules that we needed to set up for them? To address this, the team came up with the term Data Mesh Environment (DME), which consisted of infrastructure as a service, a separate access in our data platform tools, SDKs for customized data streams, incident management, training, a set of rules and recommendations, and of course a business domain to inhabit it. We focused on enabling easy and automated deployment of DMEs.

Homogeneous tooling and technology

It is said that responsibility is not delegated. Our main goal and priority has always been to take good care of our data. While allowing each domain to work with their freely chosen technology or tools sounds great, we were not ready to face the risks and challenges involved in this first approach. To give you some examples of these risks: How could we guarantee that we would not have siloed and non-interoperable technologies? How would we manage and oversee governance and good practices within the technical production itself? What would happen if someone left the company, leaving behind a custom tooling different from everything else? By setting a technical framework for those DMEs and addressing these risks homogeneously, we were able to boost our capabilities for governance, observability, DME update deployments, mentoring and upskilling. We are aware that these risks and challenges will eventually have to be dealt with, and we may not continue working with homogeneous technology in the future. Yet, as I mentioned, our main priority is to take the best care of our data and this approach provided a safe starting point.

A single entry point for final users

We should never forget that the purpose of this entire framework is to benefit the final users, the ones who need to consume data products in an easy and agile way. These users may not be concerned with who publishes the data or with where they should find or consume data based on who the producers are. The decisions to separate computing, storage, projects, teams and privileges in tools across domains should not impact the user experience. For that matter, we have created a single entry point for the vast majority of tables, consisting in a single dataset with views. These views reference every single productive table from the DMEs. The views also have different privileges and audiences that each DME may configure depending on the case. This feature allows us to migrate data products within and across DMEs when necessary. What the users see is a layer of “pointers” or “mirrors” to the actual data; we are able to move that data without affecting the already existing queries.

Simplification of data products development

In order to scale our previous knowledge and data engineering good practices from a single team to a wide and ready-to-scale audience, we have made some changes in some of our tools. One of our goals was also to lower the technical entry barriers for less experienced users. For instance, we simplified and combined a number of different steps in our data pipelines and orchestration tool — which could work together in user-created jobs-, focusing on the usability and minimizing intermediate steps and time required to configure them. This simplification makes it easier to use, reduces human errors, and improves overall user experience.

Another very important factor was having several meetings with our Data Engineers and Data Architects to define and design a release process for every data product that would be released to production. The work was interesting and dynamic with various design methodologies. Our aim was to automate rules and controls within our platform, essentially emulating what data engineers were doing on a daily basis. The outcome not only let us sleep at night, but also set the grounds for continuous improvement from then on.

Observability to boost Governance

As you already know or may imagine, governance plays a crucial role in this new strategy. Being true to our data driven culture, we created a team focused on observability. In order to boost governance capabilities, we need to be data driven, and have abundant data about our DMEs (Data Mesh Environments), our data products lifecycle and a comprehensive set of associated metrics.

While I won’t go into details about our developments for governance, I highly recommend and encourage you to read Part 2, where my teammate and colleague Vanina Bertello will provide valuable insights. Just to throw some teasers, we have developed automated data artifacts criticality levels, identified and classified master data, and many other very helpful metrics and tools.

We finally implemented Data Mesh@MELI with the first four DMEs for business domains in July 2022. We no longer have a single road to get to the city; we have multiple MELI roads, and we are definitely not stuck in them. In just 18 months, we’ve scaled up to 90 working DMEs, and more than 6,000 tables published by them - just to give you a hint of the substantial impact this strategy is achieving. Our team did pull it through, and it’s not just us who are delighted about it. Our data engineers and different business domains no longer feel frustrated but satisfied and encouraged. The road ahead of us is now loaded with new challenges… but this is just another beginning.

References

[1]: Dehghani, Z. (2022). Data Mesh: Delivering Data-Driven Value at Scale. O’Reilly.

[2]: Dehghani, Z. (2019). How to move beyond a monolithic data lake to a distributed data mesh. MartinFowler.com. https://martinfowler.com/articles/data-monolith-to-mesh.html

--

--