What is and what is NOT a Data Product

Benefits of having a common definition within an organisation of what a Data Product is

6 min readDec 13, 2023

As a data engineer, I started my onboarding journey into Data Mesh sometime ago. Through numerous experiences and insights gained along the way, I have come with a list of learnings and recommendations that could be beneficial in similar scenarios. This series starts with an understanding that has been foundational in my journey, distinguishing not just the essence of a Data Product, but equally vital, what it is not.

This knowledge helped me deeply to appreciate two of the main benefits Data Mesh brings: trustworthiness and ownership.

The basics

Data Mesh

Data Mesh is a relatively new way to manage and scale data in big companies. It is a response to the problem that traditional centralised data architectures can become bottlenecks as data volumes grow. Instead of one team managing all the data, Data Mesh spreads the responsibility across different teams from different domains, treating data like a valuable product (data as a product) rather than a side effect. Each team is responsible for the data within its specific domain, and the goal is to make data easier to find, use, and trust while keeping things scalable and efficient.

Data Product

In Data Mesh the data is organised in nodes and connections. These nodes are represented by Data Products. They are the architecture quantum that contains not just the data but also everything needed to be independently deployable (code, infra, metadata, data quality indicators) and ownership. This decentralised ownership model ensures that each Data Product operates autonomously fostering agility and accountability.

Data Platform

The shared Data Platform serves as a robust foundation, providing a spectrum of tools, services, and infrastructure that expedites Data Product development. It includes functionalities for data ingestion, storage, processing, transformation, analysis, and serving.

Data Catalog

The Data Catalog it is a capability for data consumption provided by the Data Platform. It serves as a centralised repository within an organisation, facilitating seamless exploration and accessibility of Data Products for consumers. This user-friendly interface allows individuals to search, preview, and evaluate the available Data Products, gaining insights into their content, quality, and relevance.

Now that the foundational elements are established, let’s dig deeper: Is every data asset available in a Data Catalog within Data Mesh, a Data Product? To answer this, it is essential to understand what does not fit the criteria of a Data Product.

What is [NOT] a Data Product

The term “Data Product” has become a bit overloaded nowadays. If you look for a definition of Data Product, you could find this: “a Data Product as something that helps achieve a specific goal by using data” as outlined in “Data Jujitsu: The Art of Turning Data into Product” (2012).

However, when implementing a methodology for efficiently managing data at scale within large organisations, this definition can seem too broad.

Consider a navigation app as an example of a product. It is a tool we use to find our way, taking us to our destinations via the most efficient routes. But what if that app leads us to the wrong place half the time? In such a case, we would opt for the old-fashioned method of asking someone for directions directly, as it is hard to trust this navigator anymore. The essence of any product lies in its reliability and trustworthiness for accomplishing specific goals. A product is only valuable if it can be trusted to achieve its intended purpose.

Similarly, a data warehouse that offers raw data in the form of tables and views, can also be viewed as a Data Product. We can use it to access raw data for transformation and aggregation, providing valuable insights on a dashboard to business owners. However, how can we determine if a Data Product is trustworthy? In case data consumed is inaccurate or incomplete, how can an user recognise it? A Data Product includes more than just the warehouse. It should provide indicators and metadata that reveal the data’s nature and offer transparency to users.

Data Mesh enriches this Data Product definition with the concept of “data as a product” where the data itself becomes the product. In the modern data landscape, when we use the term Data Product it is this evolved interpretation that we should be referring to

If we want to make use of a Data Product, we need more than just the data: we need access to a set of data quality indicators like freshness, completeness, consistency and uniqueness. We want to inspect its lineage and explore its metadata to understand the data’s meaning even before we begin exploring it. And we want to know the person who is accountable for this product so we have a point of contact in case we need alignment or further information. Until a source data asset can offer these capabilities, it can’t truly be considered a product. Trust and transparency are key in the world of data.

Ownership and Accountability

When you are encouraged to showcase the quality of your product, you are also driven to elevate its standards, pushing for a continuous drive towards improvement. This continuous improvement leads teams to take ownership and be accountable for the data they provide.

Other capabilities

Along with the mentioned capabilities, a Data Product must be: discoverable, addressable, accessible, interoperable, valuable, and secure. You can find a deeper explanation in this article from Zhamak Dehghani: https://martinfowler.com/articles/data-monolith-to-mesh.html#Discoverable

Recommendations

Establish a clear definition for Data Products across the organisation

It is critical to formulate a precise definition of what constitutes a Data Product and apply it across your organisation.

A comprehensive Data Product should include:

Owner, so consumers can know who to contact in case of any issue or further information.
Comprehensive and complete description of the dataset and all its properties, so consumers can understand the semantics of the data.
Data quality indicators, so consumers can know how accurate, complete and fresh is the data held.
Data lineage, so consumers can know where this data comes from.
Data sampling for quick exploration, so consumers can have a taste of the data before making an access request for the Data Product.

And all these properties should be visible through the Data Catalog.

Ensuring that all stakeholders are aligned with this concept and expectations surrounding Data Products is essential for effective collaboration and utilisation.

Enable a Data Product Certification within the Data Catalog

Should datasets lacking these capabilities be allowed in the Data Catalog? The answer can be context dependent, but exclusion definitely brings friction to data accessibility, limiting potential insights. However, a distinction between a simple data asset and an asset that can be treated as a product remains crucial.

One effective strategy is to introduce a certified label within the shared Data Catalog. This certification informs consumers that a data asset found in the catalog is fed with all the metadata needed to be treated as a Data Product.

On the other hand, including simple datasets lacking the required capabilities in the Data Catalog could lead to noise and confusion for users seeking the most reliable data. Allowing filtering options based on certification status empowers users to make informed decisions and align their expectations with the available data, promoting clarity and efficient utilisation.

This approach maintains inclusivity by accommodating diverse datasets while guiding users toward certified products that align with their needs.

Summary

Based on my own experience:

Establishing a clear definition of what a Data Product encourages teams to follow Data Mesh principles, especially in terms of data quality and ownership.
Having a product mindset on the data exposed leads teams to view their data as valuable products from a consumer perspective.
And enabling consumers to differentiate between data assets and Data Products in a shared Catalog ensures efficient data utilisation.

Hope these three main learning helps you in your Data Mesh implementations. Happy engineering!

References

https://martinfowler.com/articles/data-monolith-to-mesh.html
https://martinfowler.com/articles/data-mesh-principles.html

Thanks to my Thoughtworks colleagues Arne, Pablo, Ayush and Samvardhan for taking the time to review this article