Building data products without the Mesh

rbahaguejr
3 min readFeb 26, 2022

--

Intro to Data Mesh, Zhamak Deghani.

In 2019, distributed data mesh has been offered as a route for companies to leverage data at scale. The next generation enterprise data platform architecture was suggested to be the convergence of distributed domain driven architecture, self-serve platform design, and product thinking with data. One of the central idea of this paradigm is the treatment of “domain data” as a product and the introduction of the data mesh architecture.

However, for enterprises with large centralized IT organization, distributing the data platform into domains will be a long process or even unthinkable. For these organizations, building data products without the “Mesh” is more practical and doable.

In the next paragraphs, the concept of data product is borrowed and adapted from Zhamak Dehgani while leveraging existing data lake and data warehouse architectures.​

​Central to building data products is the identification of its life cycle. Taking SDLC as reference, the data product lifecycle has similar stages including requirements gathering and feasibility study, data pipeline development, testing, deployment and continuous improvement.

Data product lifecycle is similar to the well-known software development life cycle.

The data product development process can be further elaborated to support products: exposed raw or transformed data, operationalized insights, exposed analytics model and automated decision.​

Detailed Data Products Lifecycle supporting different forms of products.

To take out the mystery in the composition of a data product — its anatomy is illustrated below which is composed of input data (can be multiple formats), the code or data pipeline to deliver the output data (can also be of multiple formats) and the environment where these three components resides.

Data Product Anatomy. The Governance and Infrastructure is an important part of the full data product.

A ​data product can have different data inputs (some are flat files, some are APIs, some are direct database connections) which are transformed into the needed outputs though business rules or a computational algorithm. In mature use cases, outputs are delivered through different “output ports” to maximize its value. Data product output or “output ports” can be through an API, file-based extracts, automated triggers or decisions or visualizations.​

​This anatomy in a way also surfaces the required skills of data product developers. The developer team should have members with expertise in governance, infrastructure, data ingestion, data transformation, machine learning and end-to-end data pipeline. An effective data products team should combine this skills around an infrastructure, whether on cloud or on-premise. While there are much hype of bringing everything on the cloud, data product teams should be flexible in their development process, deployment and continuous improvement. ​

​In addition, data product owners who usually hail from the business side of the enterprise, should keep this anatomy in mind. Unlike traditional software development team where feature requests can be broken down into very agile execution, this may not be straightforward in data product development. ​

​In place of a data mesh infrastructure, the data lake and data warehouse can be organized according to domains of data being stored. This will enable easier domain-centric data products. Cross-domain data products will then be a matter of policy and governance.

Data architecture on the data lake and data warehouse to support domain-centric data products.

Without redistributing existing centralized data lakes and data warehouses, domain-centric data products can be delivered with carefully designed organization of data. Hive can be used for data virtualization to organize data per domain.

With data becoming very ubiquitous in the enterprise, proper definition of a data product, its lifecycle and development process should now be part of the enterprise process. In doing this, the real-value of data through the value of data products can measured, governance can be put in place (while supporting individual experimentation) and opportunities for monetization can also be discovered.

--

--

rbahaguejr

Data Scientist | Free Software Developer and Advocate, Debian & Ubuntu user. Contact: rbahaguejr2 (at) gmail (dot) com