Data Mesh — Definition.

Inclusive and thought-through.

Dr. Marian Siwiak
Between Data & Risk
6 min readNov 14, 2022

--

This article introduces the concept of Data Mesh and discusses the core ideas of the paradigm. It is an extract from “Data Mesh in Action” by Manning Publications, the first book on the implementation of the Data Mesh paradigm, which I co-author together with Jacek Majchrzak, Sven Balnojan, and Mariusz Sieraczkiewicz.

Preface

Creating a satisfactory definition of Data Mesh was in my opinion the most difficult part of writing the above-mentioned book. It involved countless hours of lively discussions, challenging and refuting arguments, and developing hard-won but fragile consensus. Many times a plot-twisting counterexample appeared out of blue and moved us all to square one. Nevertheless, it was a time well spent. Thanks, co-authors.

Data Mesh as a paradigm

The Data Mesh is a decentralization paradigm. It decentralizes the ownership of data, its transformation into information as well as its serving. It aims to increase the value extraction from data by removing bottlenecks in the data value stream.

The Data Mesh paradigm is disrupting the data space. Large and small companies are racing to showcase “their Data Mesh-like journey” all over the internet. It’s becoming the new “thing” to try out for any company that wants to extract more value from its data. I regard the Data Mesh paradigm as a socio-technical architecture with an emphasis on the socio. The main focus is on people, processes, and organization. Not on technology. Data Meshes can, but don’t have to, be implemented using the same technologies most current data systems run on.

But as a topic of ongoing debate and only slowly emerging best practices and standards, we found the need for an in-depth book that covers both the key principles that make Data Meshes work and examples and variations needed to adapt this to any company.

To start off, we will look at the core ideas of the Data Mesh as well as the benefits and the challenges associated with it.

Data Mesh 101

The Data Mesh paradigm is all about decentralizing responsibility.

For instance, the development team for the “Customer Registration Component” of a company also creates a dataset for analytical purposes of “registered customers”. They ensure it is in an easy-to-digest format by transforming the data, e.g. to a CSV file, and serving it the way the consumers would like it, e.g. on a central file-sharing system.

But this deceptively simple definition has a lot of implications because, in most companies, data is handled as a “by-product”. It is usually turned into value only after being output as a by-product into some form of storage, then pulled into a central technology by a centralized data team, and then decentralized actors again pick up the data. Be that an analyst in the marketing department, inside a recommendation system used in a marketing campaign, or displayed inside the frontend.

Figure 1 depicts a common form of data architecture, both organizational and technical. It hopefully also shows its pitfalls.

Figure 1 Decentralized data emission and central transformation cause problems for the users due to unclear ownership and responsibility for data and its quality, among other things.

We can see here two levels of centralization:

  • The centralized technology in the form of storage and the usual data engineering, data science machinery.
  • The organizational centralization of the data team.

Since the development team considers the data a “by-product”, the ownership is implicitly assigned to the data team. But such central teams usually cannot keep up with business domain knowledge within multiple data source domains. The developer responsible for Customer Registrations would only need to know the language and the updates inside that component and the associated business. But the central data team will have to have the same understanding of each domain multiplied by the number of domains. Such overload makes it unlikely that the central team will understand even a single domain to the degree the responsible development team does. As a result, the data team cannot tell whether the data is correct, what it actually means, or what specific metrics might mean.

The Data Mesh paradigm shift calls for decentralization of the responsibility for data, that is to consider it an actual product. The situation depicted in figure 1 can turn into a data mesh if the development team provides the data product straight to the analysts through some standardized data port. It could be something as simple as a plain CSV file hosted in the appropriate cloud storage spot, easy to access for the analyst. Take a look at figure 2 to see this shift in action.

Figure 2 Decentralized data transformation makes data consumers happy by offering simple access to well-described data.

A platform team could help provide a simple technology as a service to be used by development teams to deploy such data products, including the data ports, quickly.

Data producers focus on developing data products, which, together with data consumers, start to form connections and compose a network. We call such a network the mesh, where the individual nodes are data products and consumers.

Even in our small example, we observe a significant operational paradigm change. It encompasses both a shift in the ownership responsibility (from a central data team to the development teams) and the technical challenge of making the new setup work.

Introducing changes in the operational paradigm will result in ripples affecting many areas of your business. To stop it from becoming chaos, we need guiding principles.

Before that, you must understand our definition of the “Data Mesh” and its non-technical aspects.

Definition of a Data Mesh

Zhamak Dehghani made an incredible effort to curate the idea of the Data Mesh starting in 2019. She provided us with all its critical elements and introduced a structured approach to the previously discussed paradigm shift.

Since the first introduction of the “Data Mesh” approach introduced by Zhamak, many “Data Mesh”-inspired, business-derived and theoretical examples have appeared. A lot of this content might not perfectly fit into the initial description of the Data Mesh framework, as presented in this article. A lot of businesses seem somewhat unsure about what exactly conforms to the definition of a Data Mesh and what doesn’t.

For this reason, we opt for solutions that are first and foremost practical. Therefore, the Data Mesh definition we coin below aims to be broad, and functional, and emphasizes decentralized efforts to maximize the value derived from data:

Data Mesh Definition

The Data Mesh is a decentralization paradigm. It decentralizes the ownership for data, the transformation of data into information, and data serving.

It aims to increase the value extraction from data by removing bottlenecks in the data value stream.

The Data Mesh paradigm is guided by four principles, helping to make data operations efficient at scale: domain ownership, domain data as a product, federated computational governance, and self-serve data platform. Data Mesh implementations may differ in scope and degree of utilization of these principles.

The goal of implementing a Data Mesh is to extract more value from the company’s data assets. That is also the reason we keep this definition so lightweight and inclusive in relation to the level at which each of the principles is followed. The following non-technical use case of a Data Mesh will hopefully explain what we mean by that.

This was an extract from “Data Mesh in Action” by Manning Publications.

To learn more about Data Mesh check this episode of the “Between Data & Risk” podcast, which I host together with Artur Guja.

--

--

Dr. Marian Siwiak
Between Data & Risk

Your friendly neighborhood Data Guy. Co-author of "Data Mesh in Action" by Manning. Co-host of "Between Data & Risk" podcast.