Why Data Mesh?

Key reasons for data decentralization.

Dr. Marian Siwiak
Between Data & Risk
4 min readNov 21, 2022

--

This article discusses how Data Mesh relates to the big-data hangover and gives a couple of reasons for decentralizing data ownership and usage. It is an extract from “Data Mesh in Action” by Manning Publications, the first book on the implementation of the Data Mesh paradigm, which I co-author together with Jacek Majchrzak, Sven Balnojan, Mariusz Sieraczkiewicz.

Cover of “Data Mesh in Action” by Majchrzak, Balnojan, Siwiak and Sieraczkiewicz (Manning).

Data Mesh heritage

Over the past thirty years, most data architectures were designed to integrate multiple data sources, i.e. central data teams merged data from all kinds of source systems and provided harmonized sets to users, who in turn tried to use it to drive business value.

Yet, for over a decade now, the problem of big-data hangovers has plagued companies of all sizes. The data environments struggle with:

  • the scalability of the solutions,
  • completeness of the data,
  • accessibility issues, etc.

This might be familiar to some of you. Some things simply seem to not work out. Dozens of reports & dashboards seem to be of no use compared to the costs of creating & maintaining them. A bunch of data science projects seem to stay stuck in the “prototype” phase, and those running data-intensive applications probably are facing a bunch of data-related problems. At least it should seem that way compared to the effort it takes to get a software component to run. Just not yet right.

Where does the scalability problem come from?

One of the reasons for the scalability problem is the proliferation of data sources and data consumers. An obvious bottleneck emerges when one central team manages and owns data along its whole journey: from ingestion through transformation and harmonization to serving it to all potential users. Splitting the team along the data pipe does not help much either. When engineers working on data ingestion change anything, they need to inform the group responsible for transformation. Otherwise, the upstream systems may fail or will process the data incorrectly. Required close collaboration between the engineers leads to the tight coupling of all data-related systems.

The other problem arises from the monolithic nature of data platforms, such as warehouses and lakes. As a result, they often lack the diversity to reflect the reality encoded in data derived from sources and domain-specific structures. Moreover, enforced flattening of data structures reduces the ability to generate valuable insights from the collected data, as crucial domain-specific knowledge gets lost in these centralized platforms.

Example

The car parts manufacturing company was buying data related to the failures of different parts. Even though the provider had information on the part provenance, i.e. the model the part was installed in, the buyer had no data models allowing it to store this information. As a result, components were analyzed separately, hampering R&D’s attempts to understand the big picture better.

Two more interwoven factors exacerbate the problems described above, in particular:

  • unclear data ownership structure,
  • blurred responsibility for data quality.

Data traveling through different specialized teams loses its connection to its business meaning, which means that developers of centralized data processing systems and applications can’t and won’t fully understand its content. In contrast, data quality cannot be assessed in disconnection from its meaning.

Similar problems have been recognized in other areas of software engineering and have resulted in the emergence (and success!) of Domain-Driven Development and microservices. Application of similar thinking (i.e. focus on data ownership and shared tooling) to data engineering led to the development of the idea of the Data Mesh.

Just to remind you:

Data Mesh Definition

The Data Mesh is a decentralization paradigm. It decentralizes the ownership for data, the transformation of data into information, and data serving.

It aims to increase the value extraction from data by removing bottlenecks in the data value stream.

The Data Mesh paradigm is guided by four principles, helping to make data operations efficient at scale: domain ownership, domain data as a product, federated computational governance, and self-serve data platform. Data Mesh implementations may differ in scope and degree of utilization of these principles.

To read more about the Data Mesh paradigm, check my previous article.

Why decentralize your data?

To sum up, we see three main reasons why the data world is in need of decentralization in the form of the Data Mesh:

  • With the proliferation of data sources and data consumers, a central team in the middle creates an organizational bottleneck.
  • With multiple data emitting and consuming technologies, central monolithic data storage creates a technological bottleneck and much information is lost due to it.
  • Both data quality and data ownership are only implicitly assigned, which causes confusion and a lack of control in both cases.

This was an extract from “Data Mesh in Action” by Manning Publications.

To learn more about Data Mesh and the book check this episode of the “Between Data & Risk” podcast, which I host together with Artur Guja.

--

--

Dr. Marian Siwiak
Between Data & Risk

Your friendly neighborhood Data Guy. Co-author of "Data Mesh in Action" by Manning. Co-host of "Between Data & Risk" podcast.