What’s the big deal about Data Products?

8 min readNov 11, 2022

It is hard to read a data governance article these days without coming across a reference to Data Products or Data Assets, and how they can enable a Data Mesh architecture. In this viewpoint, we’ll unpack what a Data Product is and why it such an important concept, in this order:

What is a Data Product?
Rationalization of the Systems Landscape
Use Case Enablement
Data Governance Simplification

What is a Data Product?

A Data Product is set of prepared data or information (and hence specifically not raw data) that is ready to be consumed by a wide set of consumers. Data may come from different sources and in different formats, and is then transformed and consolidated into a new asset that can be consumed by others.

An important feature of a Data Product is that it is trusted. Consumers will only start using it, and continue to do so, if they can rely on it. This implies governance in a number of ways. Content should be labeled and described to ensure proper understanding and interpretation. Data quality should be guaranteed, and where appropriate, concerns should be shared transparently alongside the data. Changes in the data must be communicated in a timely fashion to the consumers.

A Data Product is created with specific purposes in mind. Consumers may have specific demands in terms of the timeliness of the data, ingestion methods, quality standards, level of granularity, and grouping of data. These needs are translated into requirements for the Data Product. New requirements or feedback can come about during the lifetime of a Data Product and may need to be prioritized and incorporated.

Finally, a Data Product should be accessible and consumable. This means that it needs to be discoverable — consumers must be able to find it, and so do the data engineers that are trying to connect to it. It must also be interoperable — consumers and the technology they use must be able to establish a connection, pass any authentication steps, and ingest the data for their respective use cases.

Rationalization of the Systems Landscape

In most organizations of a certain size, there is no clear and accurate picture of the systems and applications landscape. In fact, many organizations recognize that they don’t know their exact data footprint, that data is duplicated in many places, and that they pay for multiple sets of comparable tooling.

Data scientist and business analysts spend a lot of time locating the right data, with data quality issues rearing their ugly head, specifically in any sort of AI model building. In many organizations, the exact same data is actually “cleansed” multiple times by different, separate downstream users, with data remediations not flowing back to the source.

All this corresponds to the left rectangle in the below figure:

You cannot solve this overnight, for all systems and use cases at once. However, through a step-by-step process of formally recognizing Data Products, you can bring about enhancements and simplification. As you identify and formally label data sources as Data Products, you get a better understanding of the data that is fed into these assets as well as who is consuming it.

Data Products should be (mostly) mutually exclusive and collectively exhaustive, and assigned to functional or data domains. No two Data Products should contain the same collection of data. When you go through the process of identifying them, you will inevitably run into duplicative or overlapping data sources and data technologies. This is where real money can be saved by rationalization — that is, by demising systems and removing technologies that are no longer needed if you were to adopt a Data Product. In the figure above, you move towards the right.

Use Case Enablement

In an organization of at least moderate size, you’ll typically be able to identify at least 25 data-driven use cases. These may relate to Customer Relationship Management, Customer Segmentation, Call Center Operations, Product Development, Accounting, Budgeting, Finance, etc. All of these use cases need data, from domains such as Customer Data, Product Data, Transaction Data, Finance Data, Compliance Data, HR Data, Real Estate Data, Macroeconomic Data, etc.

If you try to imagine and visualize all the possible permutations, you could think of a grid like the one below. In the below example, there are 50 use cases and 50 domains, yielding a possible 2,500 intersections. Of course, not every use case needs access to data from every domain, so that’s a maximum number. But many organizations have upwards of 100 identified use cases and upwards of 100 sub-domains (or principal data sets), so you can imagine that coming up with an effective data strategy to serve the use cases with the right data from the right source can be quite a logistical challenge — and possible a real head ache. Where to start?

This is of course meant rhetorically — my suggestion is to start with this very grid. The steps could be as follows:

Identify a long-list of data-driven use cases.
Prioritize the use cases.
For the prioritized use cases, identify the data they need.
Identify possible “Data Products” based on the mapping you created between use cases and data domains.

After step 3. above, you have a picture such as the below. In this example there are 10 prioritized use cases, which have been found to need data from 4 different domains:

If we reshuffle the columns and rows a bit (the visualization is arbitrary anyway, and of no consequence), we might get a picture like this, where all of a sudden a very clear area of intersections appears where the creation or enhancement of Data Products may drive a maximum amount of value:

Empirically, it turns out that not all Data Domains (and as a result, Data Products) are created equal. If you were able to assess the impact or usability of domains and order them accordingly, you would find that with just a handful of domains you could power the majority of use cases. The Pareto chart below illustrates this — the red dots show that with 20% or 40% of the domains you could power ~60% or ~85% of the use cases, respectively. Although the below graph is illustrative, as for each organization it may turn out somewhat differently, in my experience the below one is very conservative.

It has been a historical struggle for data organizations to articulate the value they add to the enterprise beyond hard-to-measure (although very credible) claims such as the avoidance of regulatory fines. Building an overview of the Data Products and the use cases that depend on them allows for a clear articulation of the impact generated by these assets.

Impact assessments can be executed much more efficiently as well, by gathering the downstream requirements for the data and evaluating how the data can be controlled and enhanced within the trusted distribution point. In one marketplace example, a leading insurance party was able to measure relatively precisely how an enhanced set of customer data enabled them to more easily execute and increase the effectiveness of sales campaigns.

Enabling use cases — the one and only way of creating customer or business value — therefore boils down to managing critical data, which in turn can be done most effectively through a set of Data Products.

Data Governance

Given that Data Products are used by a large group of consumers, it is a very logical location to implement data quality and governance controls. In that governed asset, the content is labeled, and data quality is tightly controlled, so that instead of identifying and measuring this data throughout the enterprise, which often results in inconsistent “versions of the truth”, there is a trusted distribution point for a given dataset.

I once wrote about this already when I worked for Deloitte in Latin America, so let me borrow from that article. Consider the visualization below — if we identified the source highlighted in green as a Trusted Source or Data Product and we were to enhance the corresponding data quality, you can see how 11 downstream applications would instantaneously benefit from it:

If you think the above visual exaggerates, just think for a moment about data from common domains such as Customer Contact data. Address information, which is just one subset of Customer Contact Data, could conceivably be used for use cases as varied as Customer Segmentation, Sales Campaigns, Product Recalls, Account Verification, Market Planning, and Written Notices.

The return on data governance investments is thus higher, as the impact is experienced throughout the enterprise. But there is an effect on the cost side as well. If you consider the same visual above and assume that all downstream systems are business critical, it follows that without a trusted source, completeness and accuracy checks would need to be implemented for each downstream system. However, with a trusted source, the only thing downstream systems need to do, is evidence that their data is consistent with the source they took it from. This can reduce the required number of data quality controls (and the corresponding costs) by a factor of 2 to 5. Indeed, that was the exact conclusion of the article we wrote in Latin America.

Summary

To conclude, yes, Data Products are absolutely a big deal. Focusing on Data Products would be a very tactical and easy-to-explain approach to ensure measurable business impact:

By saving costs through the removal of redundant systems and technology.
By enabling the data-driven use cases with the highest expected added value.
By implementing data governance in a most cost-effective way, yet ensuring that the entire organization will enjoy the benefits.

It could help data organizations avoid the fate of so many of their predecessors that struggled and failed to articulate the value they could add.

References

Sources

Icons Attribution

Icons created through Flaticon, by individual authors including Freepik.