How Data Artifacts Criticality is Enhancing Data Governance at Meli

Marielle Côrtes
Mercado Libre Tech
Published in
5 min readJul 2, 2024

Data Artifacts Criticality is one of the most recent and highly valued data products created by Data Governance Team at Meli.

Ever heard of Data Mesh? If your answer is no, let me explain. Data Mesh is a concept that tackles decentralized data management, treating data as a domain-oriented product. It’s about providing autonomy, knowledge, and tools to individuals within business domains so they can produce and manage their data in a decentralized environment.

Sounds promising, doesn’t it? Indeed, and it’s a reality at Meli! Currently (jun/2024), we have around 200 domains across business units generating relevant data products, even without the conventional roles or technical backgrounds typically associated with traditional data careers. Most often, it’s business professionals crafting data products to address their specific needs, which is quite impressive!

For more details on the Data Mesh developed and implemented at Meli, I recommend these enlightening articles authored by the brilliant minds behind the project:
1 — Data Mesh @ MELI: Building Highways for Thousands of Data Producers by Ignacio Weinberg
2 — Data Mesh @ Meli: Empowering data owners by Vanina Bertello

But what about the quality and integrity of these data products? Who is responsible for providing observability about them? We are! 😊 Like the Data Governance Team, we gather, organize, analyze, suggest, and provide visibility into everything involved in data production.

This encompasses Integrity, Quality, Relevance, Availability, Costs, Criticality, Lineage and more.

The Criticality initiative was designed to revolutionize how we manage the criticality of data products within our Data & Analytics area. Its primary aim is to prioritize these products properly by precisely classifying their criticality levels, using consistent evaluation criteria and continuous assessment over time. Understanding the criticality of data products also helps us define future obligations, responsibilities, and guarantees associated with them, ensuring a more structured and reliable data management system.

Previously, the Data Engineering team managed the job’s criticality in a centralized manner, relying on the analyst’s judgment for assignment. This led to inconsistencies, as not all jobs had an associated criticality, and the rationale behind the determination was often only known to the analyst. Different analysts applied different criteria, causing further disparities. Additionally, criticality was set at the job’s creation and remained unchanged throughout its lifecycle, leading to outdated jobs having high criticality or important jobs being assigned low criticality.

With the large-scale decentralization brought by Data Mesh, rethinking the preexisting criticality model and converting it into an automatic multi-point evaluation became fundamental to support one of Data Mesh’s most important pillars: observability. Today, with over 10,000 data products generated by more than 200 different teams, a centralized or subjective assessment by an individual would be impossible.

With this understanding, we developed the Data Artifacts Criticality, an automated process based on a Machine Learning model that assigns criticality in 5 different levels to various data products such as dashboards, tables, and jobs, developed within the Data Mesh environment. Data Artifacts Criticality helps Data & Analytics Teams by providing prioritization, scalability, and governance of the most to the least critical products.

Deliverables of Data Artifacts Criticality

When we began developing Data Artifacts Criticality, our initial step involved creating numerous Key Performance Indicators (KPIs) to analyze the usage and relevance of our data products. These KPIs encompass metrics such as Lineage, Availability, Integrity, Relevance, and other categorizations. They provide insights into the behavior of our data products, including the frequency of access, the diversity of users accessing them, and even the number of their successor artifacts.

Initially, we experimented with a developer model using clustering algorithms, examining product behavior in different clusters and the distribution of values for metrics considered essential for criticality categorization. However, we soon realized that clustering algorithms were not yielding the desired results. As a consequence, we shifted our focus to developing criticality targets for training with predictive algorithms. After extensive testing, we opted for a well-known and established algorithm: Random Forest, an ensemble learning method that generates numerous decision trees to produce a result, selecting the best option among them by aggregating their predictions. This approach enhances the accuracy and reliability of the predictive model.

Random Forest Decision Trees

After implementing the machine learning model, we established specific rules to ensure consistency in product criticality throughout its lineage. This includes inheriting the criticality of successor products, thereby standardizing criticality based on the entire lineage across the data ecosystem. This approach guarantees that all predecessor products of a critical process are also deemed critical, ensuring comprehensive criticality management.

Example of Data Artifacts Criticality Lineage

Grasping the criticality of artifacts empowers individuals to refine their analyses and deepen their understanding through access to comprehensive documentation of pertinent products. It is essential to emphasize that the Data Artifacts Criticality acts as a potent catalyst for accelerating processes and enhancing intelligence at Meli, thereby directly driving business advancement.

Regarding the delivery and distribution of Data Artifacts Criticality, we have developed an API (Application Programming Interface) that houses artifact details. This API seamlessly integrates with various applications, enabling the publication of criticality information in the Data Catalog. This ensures a comprehensive repository and catalog for documenting all our data products.

We actively employ this feature across a wide range of products within our domain, treating it as a core capability. This integration incorporates a retry model that automatically retries processes upon failure, alongside innovative methodologies for calculating uptime. For monitoring purposes, we have meticulously crafted dashboards that monitor the dynamic evolution of product criticality over time, providing invaluable insights into their performance and reliability.

All of this enables us to make informed decisions about these products and effectively communicate with the responsible individuals and departments, sharing best practices and controls. These controls encompass alert configurations and other measures to ensure the availability and quality of data products.

We’ve gathered a significant amount of feedback and realized that Data Artifacts Criticality is contributing in unforeseen ways. It assists various areas, particularly in aiding domain owners with the table migration processes, strategies to maintain uptime goals, and other aspects of data governance.

The development of Data Artifacts Criticality has empowered everyone by enabling access to and interaction with the most relevant data products. This has transformed criticality a key feature within the Data Analytics Ecosystem at Meli, helping us visualize, prioritize, and focus resources on what is most relevant for our business.

--

--