Data Mess to Data Mesh

Chaim Turkel
Israeli Tech Radar
Published in
6 min readJun 10, 2023

Self-service platform

Zhamak Dehghani founded the concept of Data Mesh in 2018, a paradigm shift in big data management toward data decentralization, and since has been evangelizing the concept with the wider industry (expert from Thoughtworks).

So what does all this mean? What are the problems that need the Data Mesh solution?

Data Mess

So if we go back a few years, I wrote a data platform on Google bigquery and Airflow. We ran hundreds of queries on different schedules. One day towards the end of the day, the BI team came to me with an issue that the monthly report is giving results that are not logical.

So now we have a problem. I know the data platform from the technological side but do not understand the actual data. The BI team understands part of the data, but none of the data platform technologies. So for the next day, I needed to reverse engineer the queries and ask each team if the data made sense. This process took a full day and needed multiple teams to be in the loop -> Data Mess.

Displaced responsibility

So what causes the data mess?

This is due to the fact that the data platform is a very complex beast. To tame it you need a large team to fully understand the technologies and the dependencies between them. So obviously a team that needs to create data on this platform, cannot understand and handle the platform, this is deferred to the platform team.

The data platform team, on the other hand, understands the technology but does not understand the data of the whole company. Hence the data mesh concept.

I will get ahead of some issues that I have heard. Data mesh is not a silver bullet. Data mesh will not slay the data werewolf, but it might tame it.

Data Platform

What we are trying to achieve is “data infrastructure as a platform”.

What is a data platform? From the following picture you can see that there are a lot of moving parts:

Since creating a data platform is very difficult, companies migrated to the model of a central team to build the data platform, and to give services to the whole company.

What changes occurred in the industry at the same time that we created the data platforms?

We converted monolithic applications to distributed systems. Software design gants have moved from the waterfall style to the agile style. Teams have been broken down from application teams to software guilds.

All of this brings us to the concept of decentralization. The data is already being created in a decentralized manner, so instead of bringing the data knowledge to the platform team, we want to bring the platform to the data. This is very similar to the beginning concepts of spark — bring the compute to the data and not the other way around.

So what are we trying to achieve?

We want to empower the team that knows the data, but With great power comes great responsibility.

As in the data mesh concept, each team needs to own its data and service it as a product. That means that the teams need to define an “API” definition of what data it will expose in the data lake/warehouse. The “API” needs to be defined by talking with the other teams that will ingest the data.

Of course, there might be layers of data, and each layer will be the owner of his layer and will need to define the “API” that he will be generating. So for example a BI team that creates metrics over multiple microservices, will need to create the BI “api” layer, based on each microservice “api” layer that was brought into the lake.

The data mesh concept is a concept and not an implementation. So why has it been so hard to implement?

In my opinion, the reason is tooling, and the tide is changing.

To help bring the data mesh to an organization you need proper tools. You need an infrastructure that will enable all teams on all layers to create, define and populate their data. They must have the tools that will not only make creating the data easy, but also have visibility to the full pipeline status, so when there is an issue with the data, the team that generated the data will be notified.

DBT as an example

DBT is an abstract platform for generating sql ETLs on multiple vendors. It brings the ETL to a new level by bringing engineering best practices to the data ETL.

Behind the DBT are multiple SQL and YAML files. With these files, you can create silos of data and define layers of visibility of data.

DBT comes with a nice set of features, and be fairly easily enhanced by creating your own cli tool.

You can envision that in the same place that you define your elt, you also define your “API” layer. In the same place, you can also define alerts on layers of the data to different teams. All ETLs are deployed without any knowledge of orchestration tools or pipeline tools.

End-to-end example

So what is a data pipeline? I am in charge of the payment service. I have a microservice that accepts orders and set up payments. All the information is saved to the database. As part of the data mesh concept, it is my job to understand what information from my service is needed for the management layer and the finance layer to get their information.

Let me envision that I had a tool with a few questions, like from where do you want to extract the data (mysql), and to which warehouse you want to bring it (with or without cdc), would build the pipeline for me. Not only would it build the pipeline but it would configure monitoring and dashboards that are already tailored to my team so we always know what is happening.

In addition the same tool would be able to leverage a DBT like tool and let me write a few sql and yaml files to create my transformations and add RBAC definitions and scheduling of the data.

A tool like this could easily enable a company of any size to be able to create and manage its data by each team that generates the data.

Of course, there will be more layers on each block of data, and each layer would use the tool to define his layer with his needs.

All the definitions would be stored in a single repository that knows how to export all the data to different data catalogs or work with different orchestrators. Once you have defined your data and the flow of the data you can then create adapters that will export and configure any external tool to do the rest.

Summary

Data mesh is a very powerful concept, and it brings clarity to the world of data. It is not a silver bullet and will not solve all issues. But with the right tooling, you can implement the concepts of the data mesh in any company with any size of the team.

--

--