What Exactly is a Data Product?
This blog has taken much longer to write than I had expected. Everyone in my circles who read an initial version had a different take on what a data product is. If data products were to be successful, I feel, it is essential we have a clear and an unambiguous understanding of what it is, and equally important, what it is not. After much deliberation, I present to you my (evolving) point of view on data products.
So, let’s go.
ChatGPT has taken the world by a storm because of how eloquently it answers our questions and the outstanding experience it provides for users. Thanks, in part, to the conversational AI tool’s simple interface, we no longer have to search and receive multiple links and then try to formulate answers, as the chatbot provides them for us. Given the results generated, can ChatGPT be considered a data product?
To get the answer, we asked the chatbot itself. This is what it replied:
ChatGPT can be considered a data product in the sense that it relies on large amounts of data, specifically text data, to generate responses to user inquiries. The model was trained on a vast corpus of text data, which it uses to understand the context and generate relevant responses. However, ChatGPT is also more than just a data product, as it is a complex system that combines machine learning algorithms, natural language processing techniques, and sophisticated software engineering to create a seamless user experience.
Let’s see how much water this answer holds.
First, ChatGPT is easily accessible to all (at the time of writing) although that may change as it moves from open-sourced to close-sourced. It provides an excellent user experience, remembering the intent of the previous questions and then forming a chain of answers. In this way, it truly smells and behaves as a product.
Chatbots based on large language models have a huge potential to change many aspects in the digital world. But, let’s be real here. The answers can’t always be trusted. ChatGPT fails the trust (accuracy) attribute. More on that later, as I will be sharing my thoughts on what constitutes key attributes of data products in a future blog, But first, let’s define the essential characteristics of data products.
Defining Data Products
One lesson we have learned is to stick to the problem statement and not get embroiled in “defining” stuff. Definitions are a slippery slope that no two people can agree on, and it takes the focus away from solving problems.
For instance, the first question skeptics often ask is, what is a data product? And are we “data product washing” any data output because it sounds, oh, so sophisticated? For some people, a data product is not anything new. For others, a data product is less about the tangible outcome, but more about how one builds and deploys it. Indeed, applying product management best practices is a very crucial defining factor of data products.
Besides the product management aspects, a data product offers superior, consistent, and reliable data access which allows consumers to get answers to their questions (or a chain of questions) to support business decisions or outcomes. A data product stands out on two important characteristics: user experience and trust. It also has an owner who is accountable for its quality and reliability. It is a self-contained interface to get answers to all kinds of business questions and, in most cases, consumed via a self-service interface. Most data products are read-only.
The figure below summarizes the components of a data product.
A data product is a combination of:
- Data sets which may be in a table, view, an ML model, or a stream. The data may be raw data or curated data integrated from multiple data sources. The data product must publish its data model.
- Domain model which adds a semantic layer. This layer abstracts the technical layout of the storage layer and instead exposes the business-friendly terms to the end-users. This layer also stores various calculations, metrics, and the transformation business logic.
- Access to data via APIs and other visualization options and with access control policies enforced.
A Data Products Catalog is also critical as it is used to make data products discoverable with all the necessary attributes documented. See my recent Forbes article for further details. This catalog may not be a standalone product but an extension of the existing data catalog. It acts as a marketplace for data products.
Put simply, a data product conveys trust and the product features meant to solve business problems. A data product has measurable value. It has an owner who is responsible for delivering value throughout the product’s lifecycle from design to retirement.
The rest of the document goes into further details. But we have already committed a common faux pas, which is to jump quickly to technology to find answers. Data products are most often built for the business users. And hence, it’s more appropriate to look at the business view first, before diving back into the technical aspects.
Business characteristics
Let’s be honest — business users don’t really care how IT people label and categorize technology, because they are focused on solving issues the organization needs. So, if the IT staff has to explain a data product to the business, it must be bereft of technical jargon.
Today, users have to go to a dashboard for analytical answers, an ML model for prescriptive, and search databases for diagnostic queries. A data product offers unified self-contained access to get answers to different types of questions — diagnostic, predictive, prescriptive, analytical, etc.
To separate an actual data product from business lingo, let’s get some help from the physical world of products. Imagine a box of your favorite cereal. The box has the goods (say, Cinnamon Toast Crunch), and a description of its ingredients, nutrition details, expiration date, etc., and a price. The cereal is definitely a product that you can find in the designated aisle of a grocery store and purchase.
Now, imagine you can somehow procure the cereal without the box. Is it still a product? After all, the cereal hasn’t changed, but based on the above explanation, it is no longer a product.
To start with, it can’t give you the experience. If we have questions about the freshness of the content you have no way of knowing when the content might go bad, nor can we go to the brand producer and request a refund. We don’t know who the manufacturer is and, in this form, it no longer provides information on the cereal brand nor does it promote “trust” and “experience” in its packaging content.
Now, let’s apply this analogy to digital objects. Just like physical products have a brand, digital products must have an identity. This identity comprises a label, tag, user consent, purpose, and a statement of trust and reliability. With the data product, beyond the core function, it may include the output table schema, data dictionary, data distribution, semantic layer, metric, agreements and contracts and other telemetry including intermediate snapshots to help service the product in runtime.
Today, the documentation, business logic, metric, etc. exist but are not a part of the table. They are an afterthought and are out of band, like on a SharePoint site or in various different BI tools. The result is this documentation soon goes out of sync as the schemas evolve. Also, if an individual uses a different data access tool, then the logic may not be available. This is common in the traditional approach which leads to duplication of effort and increases the chances of errors.
To summarize, simply publishing a data set does not make it a data product. It must have the other components — a product management process, the domain wrapper comprising a semantic layer, business logic and metrics, and access.
The data products should also serve broad domain use cases. For example, the marketing team may collect customer data from multiple sources available, such as Salesforce, SAP, Marketo, Hubspot, website logs, surveys, etc., and produce a “customer master.” This foundational data asset can then be combined with the rest of the components discussed earlier and be packaged as the marketing team’s data product. Other domains, like sales and finance can trust its data and use it to derive their own outcomes or even build their own data products. Data products make data agreements more transparent and actionable between data producers and consumers.
Now that we have defined the data product from a business point of view, let’s turn to the technical definition of a data product.
Technical Characteristics
A data product abstracts the physical storage location of the content, which may be built using data sources that are on-premises or in multiple cloud providers. It also hides the complexity of the data pipelines from the data consumers. That pipeline may involve data movement, data virtualization, in-memory, caching, a lakehouse, or a fabric.
This abstraction is similar to a consumer that does not have to think about how their cereal was manufactured, packaged, and/or transported. In the past, we expected the business to understand technology to be most effective. In the fresh approach, the business can expect to get the same consistent outcome as they get every time they buy a box of the Cinnamon Toast Crunch without having to know any details.
The technical definition is incomplete without documenting the non-functional attributes that the business needs, like repeatable experience, reliability, concurrency, response time, uptime, etc. More on that later as will cover the process of building data products in yet another blog.
While there is very little agreement in the industry on which of the following are a data product, let’s examine each:
- Table, schema, or a view
- Data warehouse, data mart
- Report or a dashboard
- ML model, advanced analytics
Table, Schema, View
A table by itself is not a data product because it may have references to keys in other tables. In other words, it may not be self-contained. Ok. Some people disagree as a table can easily be self-contained by flattening or denormalizing it. But does that constitute a data product?
It can be if we combine it with a semantic layer as our definition states. Why is this important?
Remember, a data product is giving its user a superior self-service user experience without needing to know the physical details. In addition, it’s abstracting the user from changes in the source schema. When the schema changes, the data product owner creates a new version of the data product and makes it available in the data product catalog. In other words, product management aspects are critical for a data product to be called one.
If each table and its metrics become a data product, we will soon have an unmanageable mess. Also, what is the point of a data product if each table is one?
Data Warehouse, Data Mart
Data products should follow the shift-left principle, and be created by the domain teams for an unbounded set of use cases. A data product more closely aligns with business domain entities, events, and its interactions and behaviors. The data product owner is accountable for delivering the data product’s agreed quality, although the responsibility for defining data quality is done by the data consumer based on their requirements.
Data marts were built to answer very specific business domain questions, so they surely must be a data product right? The answer is no. Data marts, data warehouses, data lakes, and lake houses are data management platforms as opposed to being a data product. Traditionally, a data mart is an IT deliverable that arrives after a long and tedious data warehouse build, at which time business needs may have already changed. If the product management approach were to be applied to a data mart, then it can be used to develop data products. In addition, a data mart product should be agile and support various modes of visualization, advanced analytics, and query engines.
Report, Dashboard
A report or a dashboard is one of the components of a data product. It has access to data and metrics. The access can be via APIs, or a language like SQL. It must have a designated product owner, and be built using product management principles.
ML Model , Advanced Analytics
An ML model, like customer churn or sentiment analysis follows the same criteria as defined above for reports and dashboards. It is a component of a data product.
To Recap
Let’s review our understanding of the business and technical characteristics of data products by looking at yet another example. Imagine if a business user’s goal is to be able to analyze monthly active users (MAU) of their SaaS product with accurate and up-to-the-minute data. Then imagine if they want to be able to compare against historical data and also predict the MAU based on configurable parameters.
To meet this requirement, they would need unified analysis of historical data, streaming transactional data, and predictive analysis. The data producer, the marketing department in this case, is responsible for not only providing the data but also access control policies that adhere to the relevant regulatory compliances and the APIs or the GUI. This is an example of a data product. It abstracts data from structured databases as well as semi-structured log files using a semantic layer which contains the necessary formulae and calculations. The API access endpoints should support various options, such as HTTP/JSON, GraphQL, SQL, etc.
As the demand for data ratchets up, fault lines are appearing in our current data architectures. Traditional architectures were built for an era where a set of tables could satisfy most requirements of reports and dashboards. But as the number of data sources, users, and use cases have grown exponentially, the toolset on top of centralized data has fragmented as have the roles. Data consumers today are savvy and have high expectations. They want data to be responsive, high quality, reliable, and at predictable cost, and no longer want to be treated as the beta testers by the data teams. Trust and user experience of analyzing data are paramount.
There is a sense of urgency in organizations that want to drive innovation and increase their competitive advantage. The current approach to data is leaving data teams constrained and unable to deliver at the speed at which the business teams are devising new ways to drive intelligence from their data assets. The data teams need to stop obsessing about the new cloud data warehouse or the new lakehouse, but instead rethink how to delight their business counterparts, aka their customers. This is what makes the data products concept transformational.
It is crucial to get the data products definition correct so that we have a common understanding. According to Wikipedia, in the 1970s, E.F. Codd set out to define 12 rules of relational database, to “ prevent the vision of the original relational database from being diluted, as database vendors scrambled in the early 1980s to repackage existing products with a relational veneer.” As they say, history repeats.