Data mesh is not a magic fix to build data product

Nicolas Claudon
Data & AI Masters

--

Photo by Alina Grubnyak on Unsplash

Datamesh promesses to easily scale data from a version modeling perspective, make data available for every needs and dealing with increasing volume of data, but does it really help to build data product ?

I have been designing data platform and solutions for the past decade for many companies sometime with the business, sometimes with IT and for most successful programs with both IT and business. There is always a trade off between IT which usually wants centralize generic reusable services; and the business needs for custom specifics tailored to their own problematics, which when not met will do shadow it aka decentralized. To simplify Datamesh lets consider it as a decentralized network of data platform separated by data domains where data product are being designed built and run.

Building a data product is hard, whether in a centralized or decentralized architecture; data is never static; its models are constantly and rapidly changing which brings lot of challenges to be tackled.

Today, I am going to look at different design problems that you can encounter when designing a data product within a Datamesh architecture in these 3 major activities:

  • Ingesting data
  • Exposing your product
  • Monitoring your service

Collecting data

Collecting data to be used within the data product can be done in two ways, by consuming another data product or by managing raw data.

Consuming another data product brings strong benefits, building the data product. on top of consuming an API, downloading a file or reading events. The schema is enforced, and data are in quality. Usually it is a good approach, unless the data is part of your core business. For example using weather data within your product can be easily consume from another data product unless your are a weather company.

The only drawback is that you need to track what you have processed.But It also means when designing your product that you are depending on another team, data freshness. might not be in sync with what you are requiring. You do not have control on data quality and the exhaustiveness of the dataset. you are losing the possibilities offer by ingesting raw data. You should build for redundancy , when not able to consume data, your data product must remain accessible and meet your SLA’s

Managing raw data on the contrary is hard you get fine control but at the cost of dealing with data quality, data governance, building and maintaining data pipelines, handling errors, ensuring compliance. Furthermore in a datamesh architecture, where you are working in a decentralized approach. you loose centralization benefits like common data ingestion, data pipeline, quality controls, …

Building the core of your data product

Building the core is not different if you are in a centralized or decentralized architecture. In a Datamesh architecture you can decide on what is the best tool for the job but you will probably lack at some point of the high level of industrialization and reusability of components that you can find within an ETL, or common libraries for common chores. At the same time I have seen a fair share of time artifacts designed for reusability spending a lot of time and effort which were not reused more that twice, certainly the ROI was not met.

Exposing your product

Exposing a data product means first and most importantly givIng access to a service under SLA. For any exposed data product it must be designed and built for redundancy, scalability and responsiveness. Specific challenges exists depending on what you are building.

Exposing an ML system with a prediction service

you need to build an API that will do the prediction but you also need to build the a mechanism that will process. online features (api, or asynchronous)

Exposing a dataset, exporting file

It is by far the easiest, setup an API with long session that will not disconnect while downloading huge dataset and you are good.

Exposing records

The main challenge is to imagine how the records will be needed and find the right database. Is it fast write, fast read, document, full text search ? It is always possible to expose into multiple database but the trade off is huge, how do you synchronize the data between the databases, how do you update them while still remaining accessible, deletes and batch import will be the daily routines needed to be setup. The more technologies used, more skills Are needed to industrialize, upgrade. and maintain which leads to lack of flexibility one of the best benefits of data mesh approaches.

Monitoring

Monitoring a data product within a data mesh architecture means you are either, meeting IT standards and might be constraint by components choices for your data platform or how to build your monitoring kpis. You can also build your own montoring platform where you will be dealing with, app health and metrics, logs transformation, servers metrics.

In either approach your data product will implement monitoring and special care when designing the solution needs to be applied.

Ingestion / Collection

Monitor the ingestion / collection mechanismes, detect and predict if a problem arises how much impact on following steps and the success of the data product.

Building the core

Monitor data quality, can we infer or know the the rule to correct missing or wrong attributes value. Is my processing taking longer that previous run, how can we identify if it is a design problem (ex: not filtering values from beginning, database is sollicitâtes to read all values) or do we need to more processing power. Impact of billing overtime on the budget.

Exposing the data product

The exposition is customers/users/system facing therefore the most important part to be monitored, It is also where all business rules need to be identified since the first concept and design emerge. Common exposing kpis falls into these categories:

  • Business Performance
  • Service Performance
  • Data Drifiting
  • Accuracy
  • Detection of Biases
  • Fairness

Data Governance & security

Data governance in a data mesh infrastructure for a data product is an article in itself, but IMHO it should be done in a federated data governance, for a simple reason data governance and security are hard to implement and must be done from the start. It makes a lot of sense to centralized this function as it will be available to everyone and should be enforce to any data product.

How to

So the question remains should I implement a data product within a Datamesh architecture, there is certainly not an answer that fits all scenarios but it is clear that to have a chance to make it a success best design principles needs to be applied. By having a data mesh approach you can benefits from accelerator to build easily and rapidly a data platform, with common data pipelines, where monitoring framework are incorporated. But most importantly invest in good solution design, because data mesh will not solve that for your data product.

Best of luck when implementing and feel free to reach.

--

--