The Data Product Trap

P Platter
Agile Lab Engineering
8 min readOct 11, 2024

Data mesh is, at this point, a hyped topic.

Some have already completed the transition, many are in the process, and a staggering number are contemplating it. Despite the basic principles and the added value being, by now, crystal clear to everyone — and despite there being a mountain of books and endless streams of articles that explain the whole thing very well — there’s still an overwhelming amount of confusion when it comes to actual implementation.

There’s one element that people are still struggling to grasp, something that seems like a minor detail, one that too many are giving themselves a pass on: the data product.

Now, if we want to take the concept of a data product and twist it around in environments where data mesh is not being applied and there is no goal of decentralizing data ownership, that's fine.

But when I hear people say that a data product ( in the context of data mesh ) is simply a data asset with some business metadata slapped on, an owner assigned, and then thrown into a marketplace — when I hear people use words like “packaging” — my hair stands on end.

And when publishing a data product consists of renaming a dataset and wrapping it up nicely, without even getting into the details of who produced that data asset and how it was created — I’m already growing gray hairs.

It genuinely pains me because I can only imagine the colossal amounts of money being wasted adapting tools and products, changing business processes, training people, and launching internal marketing campaigns—all hoping to achieve some tangible benefit.

But the brutal truth is that all this effort will be in vain. Everything will change, only for nothing to change, with big promises and zero results.

Let’s not kid ourselves: marketplaces, where data gets packaged, described, and delivered, have existed for ages. There was no need to go through all the trouble of calling them data products; we could have just kept calling them data packages.

The whole point of a data product is to establish a clear boundary of ownership that goes beyond the data itself, extending to the entire lifecycle that generates that data. The data is merely the outcome of a process involving a host of components. All the elements that are part of this process and that fall within a certain business-functional perimeter make up the data product. So, the first colossal misunderstanding is this: data products are not just data; they are much more than that. A data product is a complex entity that generates data and includes metadata, applications, infrastructure, and an entire series of ancillary services that allow it to operate within a marketplace and interoperate with other products.

So, let’s set the record straight and clarify exactly what components make up a data product and what the steps in its lifecycle are.

What are the components of a Data Product?

Data Product components

Data Product — Infrastructure
Every data product must have its infrastructure (which can be physically or logically segregated) to house its data and run its processes. This infrastructure must be independent of other data products, meaning it can be created, modified, and destroyed at the sole discretion of the data product team. The autonomy of this infrastructure is essential; otherwise, we may need to remember true decentralization.

Data Product — External Interfaces
External interfaces refer to all the points of interaction where we need to apply standardization in terms of protocol and the type of information exchanged between the data product and the outside world (be it the platform, users, or other data products). Classic examples of these interfaces include output ports, observability ports, and control ports. Some also mention discovery ports or input ports. However, I don’t see the latter as crucial since a data product should never receive data in a push fashion from external systems. Thus, there’s no need to create a standard for interoperability in that regard. If your data product receives data from outside, you could have a big problem in change management.

Data Product — Data
Every data product must, as much as possible, both contain and expose data to the outside world. The data that a product exposes (in whatever form) essentially defines its value proposition. The more data persisted within the data product, the stronger the sense of ownership around it, enhancing the power of the data product. That said, virtual data products — built on data residing outside the data product — are possible. But it’s essential to understand the implications of this model, and the presence of business logic becomes non-negotiable. This is where many stumble: don’t go overboard with this pattern because ownership will not be clear.

Data Product — Metadata
Each data product must be appropriately described in all its aspects. Metadata isn’t just some afterthought; it’s part of the product itself. The metadata serves to “sell” the product by creating a promise to the buyer. And let me emphasize this: the metadata should primarily be business metadata that defines the data product, making it understandable and trustworthy. Metadata should describe the data being exposed and the expected behavior of that data (data behavior). Again, these are not just add-ons but integral parts of the product.

Data Product — Business Logic
A data product without business logic is not a data product. The business logic defines the ownership boundary and the domain to which the data product belongs. If I’m working with customer data from sales and marketing systems but applying business logic specific to risk finance, that data product belongs to the risk finance domain. Only experts within that domain can own those formulas and transformations. The business logic is what gives the data product its value and uniqueness. If your data product team doesn’t own the business logic, you’re not working with a data product — you’re just sharing datasets. This is non-negotiable.

Data Product — Internal Processes
Internal processes manage the data's lifecycle in compliance with governance rules. These include data ingestion, ETL processes, data deletion for compliance (like GDPR), and even data quality processes, which are tightly linked to the business logic. This must be built into the data product and not handled externally. Trying to manage data quality from the outside indicates that you don’t have a data product.

Data Product — Orchestration
The orchestration component needs to be entirely revisited in a data product environment. If we want each data product independent, the team must manage its operations, including scheduling and data transformations. A centralized scheduling chain would defeat the purpose—who will manage that? Internal scheduling allows synchronization between the data refresh cycle and other internal processes, like data quality.

Data Product — Internal Operations
There will always be cases where you need to carry out operations on the data, like a complete refresh or a restart due to a breaking change. These operations must be possible in production, but it doesn’t mean manual interventions. Instead, we need standardized, codified operations that can be triggered via APIs, giving the data product team full autonomy.

Data Product — Policies

These policies must be part of the product to ensure compliance with a set of governance rules. The platform team must implement these policies, but they should be injected into the execution context of the Data Product to be more synergic with its data lifecycle. The DP team does not own the implementation of the policies but owns the lifecycle and the execution within the context of the Data Product.

Data Product Lifecycle Phases:

Data Product lifeycle phases

Data Product Business Case
The first step is to tie the data product to a business process and identify a real opportunity to create value. Usually, this means aligning the data product with strategic business initiatives like OKRs or Lean Value Trees. Without a clear business case, you’re just building tech for tech’s sake.

Data Product Bootstrap
This is where you lay the groundwork: creating the software infrastructure according to the architectural blueprint set by the platform team, setting up the git repositories, and handling the DevOps side of things.

Data Product Development
Now it’s time to develop all internal processes, business logic, and external interfaces (output ports, observability, etc.), preferably based on standardized blueprints. Even if the data already exists, don’t skip this phase. In 99% of cases, the existing setup will be missing something or require adjustments.

Data Product Curation
After developing the data product, it’s time to curate the metadata. This step involves linking the data structures to the business terms defined in the business glossary or ontology. Metadata curation can include defining data-sharing agreements, value propositions, masking policies, etc. You’re still working on git, but nothing has been moved to production yet.

Data Product Validation
Since a data product has a significant application component, it must be tested and validated against all governance policies before going into production. Performing governance checks after the data has already been produced is methodologically wrong. It’s like running unit tests after the software is already live. The only tests that post-production should run are data behavior tests, such as data quality checks. Everything else should be validated upfront.

Data Product Release
Once validated, the data product can be released. However, the deployment must be atomic—every component (infrastructure, internal processes, business logic, orchestration, external interfaces, and metadata) must go live simultaneously. This is the only way to control the product’s end-to-end lifecycle truly. Simply adding metadata to existing datasets and calling it a data product means nothing.

Data Product Monitoring
Once live, a data product must be continuously monitored. This responsibility falls entirely on the data product owner and their team. Every data product represents a promise to the ecosystem, and it’s essential to maintain that promise to build trust and ensure the exchange of value.

Data Product Operations
After going live, the data product team may need to intervene with specific actions in case of incidents or extraordinary maintenance. These are the operations we discussed earlier, which are crucial for ensuring the ongoing production of data.

Data Product Change Management
Every data product should be able to evolve autonomously in response to market opportunities. The data product owner will gather consumer feedback or detect the need for new features. However, the change management of one data product should never impact others. Otherwise, you’re undermining the autonomy and ownership of the other data product owners. If you want to add new columns or tables, you need to adjust the processes generating that data and all the parts we discussed before, which is why they need to be there. Keeping the data and metadata within the product without controlling the business logic is insufficient to maintain ownership and autonomy.

Change management Impacts

--

--

Responses (1)