Data Mesh . When to mesh and when not to.

Published in

towardsdataanalytics

7 min readMay 11, 2021

So this is a hot topic in the way Integrated Data platforms are designed these days . Data Mesh . Every one wants to get under neath it and extract the maximum benefits out of this paradigm shift .

So let’s get into the detail of this concept and some real life instances where it has been implemented.

Data Mesh : A concept first conceptualized by Zhamak Dehghani . Main highlight which concept shows is that how we can move from centralized Monolithic Data Warehouse/Data Lake mindset to a more distributed mindset . Note , I have used the terminology mindset . What does it mean ?

Data Mesh is not a technology . It is a practice .

Don’t worry by the above mentioned heavy sentence . It means that in the original paper which Zhamak published(https://martinfowler.com/articles/data-monolith-to-mesh.html) , there was no mention as to what all technicalities or what all technical tools we need for implementing this . What all mentioned was a bouquet of practices and the ideology data Engineers and people interested in Data world should follow . Field is left open for people to interpret and implement using the tools they feel are sufficient . Now’s let’s dive deep into it .

Problems before Data Mesh

So the above picture depicts a standard data flow . We have three main participants in it .

i)Source System : Systems from where data is getting generated and which are the starting point in any Data pipeline .

ii)Data/ML Engineers: People working on data processing and giving meaningful form to data using various data processing techniques .

iii)Consumers : This bunch comprises of enterprise management, data scientists and Analysts . They consume the processed data and take decisions or formulate strategies .

Now what problem you see here . It looks like a happy marriage here . Isn’t it . Well not . You can notice following main issues here

i)Domain driven knowledge: That information is with Source team only . But the engineering team does not posses domain information . As a result if any change happens in source side they have to rely on the input from the source team input .

ii)Consumers : They are totally dependent on engineering team. This dependency make them vulnerable in terms of latency in arrival of data , data quality and governance .

iii)Engineering Team: These poor folks becomes sandwiched between source team and consumers . If there are data quality issues they need to let go off their weekend . If there is some changes from source side they need to sit with the source team and reinvent the wheel .

Also , apart from all these issues one major issue is of centralization and monolithic mindset . With more and more data platforms moving towards centralized system , clock is ticking in the backward direction . At some point or the other ,these systems will hit a peak and then either they have to scale out or scale up or overhaul their landscape . No doubt cloud emergence have some what soothed this scenario but there as well you get charged for what you use . With the passage of time as your usage gets increased your cost will increase no matter how much optimization you have at your disposal .

Let’s see the fundamental principals of Data Mesh :-

Product thinking : Think of data as a product not a by-product.

Domain driven ownership and architecture:Ownership of data should be domain driven and corresponding architecture should reflect the same .

Infrastructure as a platform : In place of a dedicated data housing platform use infrastructure as platform . This creates more value .

Now how to implement it technically . Let’s see it in flesh .

Netflix : Netflix data architecture is comprised of three sections .

i)Relational Business Data

ii)Media content

iii)Event based steaming .

Now this event based streaming area is the one where they have implemented data mesh .

High level Data Mesh implementation

More Magnified view

Kafka + Flink becomes the backbone of this architecture . Stream processing is handled by Flink and streaming of data is handled by Kafka . So underlying sink connection and stream processor is a combination of flink+ Kafka .

Now here you might be wondering how Data mesh is implemented . Well look again . Each source is owned by a domain and they have their source connector . So source team is domain oriented and has full control . Now they are emitting data into a stream . Once data comes into a stream consumers have their sink connectors built on top of Flink . They can process the data in these streams and consume what is required . They also publish the data back into stream which can be used by other domain . So here what difference we are seeing .

i)Data is being used as a product

ii)Source teams are domain oriented having control on emitted data and where it is emitting

iii)De-centralized approach . Consumers can consume the data from any mesh and can also beam back the data into stream.

iv)Consolidated group of meshes rather than a centralized monolithic architecture .

Another Case Study

Zalando : Leading European online fashion retailer . With the help of Databricks they were able to achieve maximum benefits from Data Mesh architecture .

Simplified architecture . Source are beaming data into infrastructure and consumers are using it as per their requirement in a distributed manner .

Bring your own bucket . Beam the data and get it utilized by end users .

Now the question comes , can we use Data Mesh anywhere and everywhere . Well not . Although this concept sounds very cool it should not be applied to all the places. I have devised a simple calculation to determine if it makes sense for your organization to invest in a data mesh. Please answer each questions, below, with a number and add them all together for a total, in other words, your data mesh score.

Quantity of data sources. How many data sources does your company have?
Size of your data team. How many data analysts, data engineers, and product managers (if any) do you have on your data team?
Number of data domains. How many functional teams (marketing, sales, operations, etc.) rely on your data sources to drive decision making, how many products does your company have, and how many data-driven features are being built? Add the total.
Data engineering bottlenecks. How frequently is the data engineering team a bottleneck to the implementation of new data products on a scale of 1 to 10, with 1 being “never” and 10 being “always” ?
Data governance. How much of a priority is data governance for your organization on a scale of 1 to 10, with 1 being “I could care less” and 10 being “it keeps me up all night”?

Data mesh score

In general, the higher your score, the more complex and demanding your company’s data infrastructure requirements are, and in turn, the more likely your organization is to benefit from a data mesh. If you scored above a 10, then implementing some data mesh best practices probably makes sense for your company. If you scored above a 30, then your organization is in the data mesh sweet spot, and you would be wise to join the data revolution.

Here’s how to break down your score:

1–15: Given the size of your data ecosystem, you may not need a data mesh.
15–30: Your organization is maturing rapidly, and may even be at a crossroads in terms of really being able to lean into data. We strongly suggest incorporating some data mesh best practices and concepts so that a later migration might be easier.
30 or above: Your data organization is an innovation driver for your company, and a data mesh will support any ongoing or future initiatives to democratize data and provide self-service analytics across the enterprise.

With this let’s wind up here . Keep meshing and keep enjoying .

Data Mesh . When to mesh and when not to.

Data mesh score

Written by Prag Tyagi