Moving from Data Lakes to Data Mesh: Why Companies will continue to Decentralize their Data.

Published in

DataReply

11 min readOct 18, 2022

Introduction
Data is everywhere. Every single person on the planet produces more than 1MB of data per second; every day. The sheer volume of new data created is not a surprising fact and therefore organizations have long ago started to tackle the problem of storing and analysing the pile of potentially valuable information. Starting with Data Warehouses in the 1980s, which were built to assist in transforming data from operational systems to decision-making support systems, over Data Lakes, that promised to be the Swiss pocketknife of data storage, to Lake Houses, that aimed at combining the best of both worlds: organizations are still searching for the holy grail of data storage and access. Following the 2021 “Executive Study on Big Data and AI” the number of companies spending more than $50M on their data initiatives increased since 2018 by more than 22%. But still only one out of four companies interviewed claims to have successfully built a data-driven organization.

Number of companies spending more than $50M on data initiatives

It seems like another shift in paradigms is needed: moving from monolithic data platforms to a distributed Data Mesh. Zhamak Dehghani first came up with the concept in 2019 and ever since received widespread recognition for it. So, what is a Data Mesh really?

To put it in a nutshell: Data Mesh is more a strategic approach than a precise architecture. It allocates data ownership to domain-specific groups that serve, own and manage data as a product. Data Mesh emphasizes the idea of organizational agility by empowering data producers and data consumers with the ability to access and manage data, without the trouble of delegating to the Data Lake or Data Warehouse team.

The best way to grasp the holistic concept behind a Data Mesh is to start understanding what its innate ideas are:

Core principles of Data Mesh architectures

Domain-oriented ownership & decentralization:
A domain is the aggregation of people in an organization that work around a common functional business purpose. As an example, the domain after-sales can be looked at. Following the Data Mesh principles, the domain should own the whole lifecycle of its data (from creation till provisioning).
Data as a product:
A data product is a collection of domain-oriented data for specific business or analytical use case. Domains own and produce data products, that are consumed by downstream domains or users to create business value. Data products allow for a clear line of ownership and responsibility and can be consumed by other data products or end consumers.
Self-serve data infrastructure as a platform:
Core idea is to separate data and technology. A central infrastructure engineering team provides a self-serve data platform, that can be easily used by members of the domains to create and manage their data products. The better the technology is handled the more autonomous the domains should be.
Federated governance:
Secure and regulated access to data products is a must. In the world of decentralization, a fine-grained access control can be globally established, but applied at the time of access to each individual dataset product.

One of the biggest shifts from monolithic data platforms towards a distributed Data Mesh is the convergence of data and product thinking. Teams forming around data domains own the whole process of ingesting, cleaning and provisioning their data product. Therefore cross-functional teams are needed, who take responsibility for their data domains.

Domain data teams are supported by a self-serve data infrastructure as a platform, that is run by a central team. This avoids duplicated efforts and skills required to operate data pipelines technology stack and infrastructure in each domain. The key to building this central service is to not include any domain specific concepts or business logic but keeping it domain agnostic. Moreover, it is crucial that the platform hides all the underlying complexity and provides the data infrastructure components in a self-service manner.

Combining the elements leads us to Data Mesh architecture from a bird’s perspective.

Data Mesh architecture from a bird’s view

Pros & Cons
Data Mesh provides more autonomy and flexibility to the actual owners of the data by giving them a self-serve standardized platform. As a result the load of data engineers is reduced: they don’t need to serve any single data consumer requirement. This approach gives more control to data producers on their own data, paving the way for more innovation and faster iteration. This solution also avoids single points of failures thanks to its distributed nature where each service owns its subset of data. This distribution of ownership also creates more trust in data and reduces complexity compared to a centralized solution.

On the other hand, Data Mesh might not be the right choice for you. Firstly, data engineering support is needed to delegate the ownership of data (pipelines, testing etc.) when a company is adopting Data Mesh in its existing data platform. This is an overhead that should be considered before undertaking the implementation. It is also possible that you are not operating at a scale that justifies distributing data domains to independent teams and introduce coordination overhead. If you are a small to medium sized company, data analytics needs might be easier to meet with a centralised data team. Whereas for a large company with thousands of employees at multiple geographical locations scaling bottlenecks will arise naturally with a centralised data team.

The decentralized nature of Data Mesh inherently works against the ability to make top-down decisions in every distributed domain. It requires trust on the distributed teams to find the best solution by themselves. This might prove difficult for some companies especially regarding topics like security, privacy, or compliance. An environment of mutual trust and autonomy is needed since topics like these will be everyone’s responsibility. Therefore, the culture in an organization will also play a key role whether Data Mesh is the right option.

Use Cases
Data mesh can be considered as a major turning point in data platform engineering: this change of paradigm can be seen as the architectural version of the shift from monolithic app to micro-services. It is an especially useful architectural paradigm for use cases where connecting distributed data sets in a way that enables data analytics at scale is important.

If you have massive amounts of raw, structured, and unstructured data. If you need to archive this raw data. If you are looking for an affordable big data storage solution. And if your company is not big enough to require the collaboration of multiple teams at different locations or time zones, then a Data Lake might be the right option for you.

On the other hand, if your company is big enough to require a distributed data solution. If you need real time reporting solutions. If you need a way to gather real time data from various disconnected systems, and a way to process them. If you need a way to allow all this to happen without a centralized team of data engineers. Then Data Mesh might be a better solution for you.

Automotive companies are fairly unique from the perspective of data as their data landscape is a combination of varied data sources, ranging from R&D to CRM, and complex regulatory and security requirements. These both impose strict boundaries and separation on the data, while also necessitating it’s sharing. When designing the next generation of Data Infrastructure for an Automotive company we were presented with the challenge of how to provide scalable and flexible self-service infrastructure, while respecting the privacy requirements and data boundaries. The principals and architectural patterns of the Data Mesh allowed us to provide a shared data infrastructure, while separating the different domains both on data and tool level. We implemented a federated access control and data access layer on top of a share-only-compute architecture. This allowed different domains to onboard their data and workflows onto the platform in a short amount of time to meet regulatory requirements. All this paved the way for data sharing inside the organization.

Needed Competencies
Building a Data Mesh can be a challenge, since many pieces must fall into place: a microservices based data platform ready to scale, resilient to different business use cases, different data users and applications. Years of implementions have taught us best practices and we have illustrated how innovative design patterns from different IT sectors converge in this new concept of a flexible data platform. To achieve such a goal and to successfully overcome the difficulties of running a business in the world of data you need competences from diverse backgrounds:

Cloud experience: You can try to build the Data Mesh in your own data centre but there are so many software-as-a-service solutions ready to be used in AWS, Azure and GCP that it seems unnecessary to expend time (and money) in local deployment.
Modern scalable platform: Kubernetes, serverless, service mesh are just a few buzzwords you might be remarkably familiar with. Microservices based distributed architecture is the backbone of the Data Mesh concept, if you had some experiences in those fields it will surely be helpful.
Batch and streaming world: The inclusion of technologies like Kafka and Pulsar when processing data in real time is a hard requirement. The dualism between tables and streams is an interesting topic to deepen your knowledge on if you are not yet familiar with it.
DevOps: Infrastructure as code, automated pipelines, jobs, error reporting, dashboards, and even more. A strong culture of DevOps will make your Data Mesh journey much smoother.

Other than technicalities there are also certain company soft prerequisites that are worth to mention:

Organizational: A strong separation between product and engineering department will be a downside because Data Meshes work better when all the business processes are involved in the implementation. Better usage of the data is key for the longevity of the company and that is why every department must take part in it.
Some previous experience in managing data lakes or data warehouse: Some of the solutions Data Mesh proposes are based on years of struggling with some frequent problems like data usage, pipeline reusability, data property, bottlenecks in the databases etc. Having some experience in solving those technical conflicts will be useful in the choice of the architecture tailored to your business needs.

As reiterated often in this article “Data Mesh” is still a young concept and it may not be intuitive to you as to how to reach your goals straight away. Killing the data monolith cannot be done in one day and as you might already know: the best way to eat a whale is — one piece at a time.

Migrating to Data Mesh
At this point, you may be thinking: “Okay, this makes sense, but how do I implement a Data Mesh?”: Well, there are various ways of migrating to a Data Mesh depending on your company needs, let’s take a look at high-level Data Mesh implementation steps.

Step 1: Create addressable data products so that data can be quickly found. Move your Online Transaction Processing (OLTP) workloads to microservices and add the metadata, a data catalogue and a lineage, etc. to the data. Use standardized bucket names and resource names for data products, eg. s3://my-domain/data-service-a/resource/date. This is already a big step towards the standardized and easy data consumption. To ensure that the data is always accessible, add Service Level Agreements (SLAs) to the end points and monitor them. Redirect query engines and business intelligence tools to use the new, independent and addressable data products. Use the same strategy for creating standard schemas and views for data warehouses, making sure you adhere to established naming conventions. Your data platform team will oversee this phase and will continue to use a centralized strategy. We need to address centralized ownership in the later steps.
Step 2: Data Catalogue (Discoverability) and Metadata. Enhance the metadata and the data catalogue to make your organization’s data products more discoverable so that anybody can find them. Within your organization, you require a place where you can look for and discover the data products. You also require a process that allows data owners and consumers to request and grant access to data products without involving a central team. Work on enhancing the data product features by adding tests for data quality, lineage, monitoring, and other features to both data at rest and data in transit.
Step 3: Implement Domain Driven Design by breaking the data monolith. This is an essential phase. In order to move toward a decentralized design, you should aim to give ownership to the domain team that is creating the data. Therefore, start transferring accountability and ownership closer to the source. Data assets, ETL pipelines, quality assurance, testing, etc. must all be owned by each team. Remember that federated governance is still required for data standardization, security, and interoperability. You simply need to make sure that these capabilities are built as services in order to develop a self-service platform. You can incorporate DataOps practices during this phase, as well as enhance observability and self-service capabilities. You may integrate your OLTP and OLAP processes and technologies in this phase. By attempting to integrate batch and real-time data using the same data governance, quality, and discoverability, you are essentially integrating the operational and data planes by using the same tech stack. Consider ideas you can bring to fruition such as observability, automation, streaming platforms, etc. For example, if your team is already using Kafka, then leverage it to build ETL pipelines for your data products. If your team is using microservices, then encourage them to own the data as well by defining and managing the schema and its changes. The goal is to gradually transition to a decentralized design by leveraging stream engines as a backbone in order to eventually achieve consistency between the old and new systems.

Repeat the approach outlined above to further decentralize the legacy data monolith once you have created your first “data microservice.”

Again, this is only one of the ways of migrating to a Data Mesh. You may decide to use some of the features and omit others when migrating. The model must be chosen that best suits your company and needs. Do not forget to design automation, observability, APIs, SLAs, security, audits, etc. for data products just like you would for any other service in your company.

Outlook
In this first part we covered an introduction to Data Mesh, discussed use cases, pros and cons and indicated needed competencies and how to begin with a migration. In the second part of our blog post we will address further specific topics ranging from the cost of implementing and running a Data Mesh, to quality and performance issues, data governance, and the organizational changes required.

At Data Reply we support our customers in architecting and implementing modern Data Mesh solutions. Feel free to talk to us about any related topic.

Authors: Antonio Di Turi, Ayhun Tekat, Komal Lalwani, Jonas Pfefferle

Moving from Data Lakes to Data Mesh: Why Companies will continue to Decentralize their Data.

Written by Jonas Pfefferle