Data Products have been widely discussed but a lot of the discourse seems to be more theoretical than practical and focuses too much on the data aspect and neglects the product component which is crucial to the whole concept.
This post is in two sections; a bit about why and then how we should document our Data Products.
The Data Lake — a Library of Information
Imagine a library. An ancient library filled with rows upon rows of books and ledgers. There are cardboard boxes containing reams of information that hasn’t been sorted (raw data). In the far corner there are even ticker tape machines delivery information in real-time (streams) and terminals that allow the curious visitor to retrieve very specific information if you know how to input the right query (APIs). There may even be a locked library within the library containing information only suitable for a select few (PCI/ PII data). (Fun fact; my father once spent days in such a library researching werewolves for a medical paper — that’s definitely another story!). This is your Data Lake. Perhaps there is an aisle of shelves that contains well organised ledgers that are mostly up-to-date with a sheaf (really) of Librarians in attendance, ready to answer questions. This is your Data Warehouse.
Now imagine a visitor to the library. The hapless visitor is looking for some information, but where to start amongst the rows and rows of data? The visitor might ask a Librarian — a select few who hold ‘tribal’ knowledge about the contents of the library. This is how libraries must have been prior to the introduction of the Dewey Decimal system and the Library Index Card
And this is how libraries created a Product that served information to their visitors and improved the experience for everyone. OK, and probably resulted in a lot of Librarians being laid off.
And so it goes for all the Data that our companies accumulate.
We need to create the proverbial Index cards that describe the Data that is available, how to use it, its reliability, etc, etc.
And this is where Data Products come in.
What do we mean by a Data Product?
For the purposes of this post — and for practical and actionable reasons, I’ll define a Data Product as:
- A Dataset, Table or View — served through a Query Engine (BigQuery, Dremio, Snowflake, etc.)
By Dataset here I mean one or more Tables that are used in conjunction — e.g. Dimension and Fact Tables
- A Stream of data served through Kafka with an associated Schema
- A Machine Learning Model
Or perhaps its output which would then most likely fall under either Stream or Table
- An API
There are certainly other ‘things’ that could be described as a Data Product, but this will cover the majority of data assets that really benefit from a more Product focused approach.
Is the product the documentation (the index card) or the data (the books)? It makes little difference; there really shouldn’t be any consumer-facing data being created without documentation — they should be considered as one and the same.
Is a Data Product specific to one Table (or Topic, etc.)
In a word, no — there is a many-to-many relationship between Data Products and Tables. To illustrate this consider how central
customer is to an Organisation: We may want to look at their billing history or perhaps the history of our contact with the customer: We may have 2 Data Products:
- Customer Billing History
- Customer Contact History
Both of these Data Products will likely have fact tables containing the detailed information about
contact over time. But our Data Product likely means that you need to join that data to the
customer table - a dimension table common to both.
The Customer Dimension Table will be consumed by both Data Products and each Data Product consists of 2 tables. IRL there may be more than a couple of Tables.
These underlying tables may be the responsibility of other Domains and that’s a good thing: The Customer Domain are the Subject Matter Experts of Customer data and it’s their responsibility to expose a Customer Data Domain Table as a Data Product — it’s within their Bounded Context.
Why do we need to document Data Products?
One of the key tenets of Data Mesh and a key enabler to allow companies to scale effectively is to emphasise the serving of data to consumers that require it. It is not sufficient to just make a table of data available; consumers need to understand the context, quality, lineage and other aspects of metadata to make informed decisions without having to do extensive manual research and conduct their own validity test. Data Products must be designed with affordance.
Affordance — a use or purpose that a thing can have, that people notice as part of the way they see or experience it
Build-Time and Run-Time
The type of information relevant for defining a Data Product can be loosely split into information known at “Build Time” — i.e. when the Data Product is initially created (or updated) and “Run Time” information which should be appended to the Data Product definition when it’s actively being run or maintained in Production. (Although “Live” may be a better definition as Data Products running in lower environments also need to be well documented to serve developers.)
Data Mesh proposes that the Data Product is self-describing. Practically speaking this is possible with some limitations. Much of the Product itself is well understood when it’s built — the Product Manager has an idea of what the Product does, who it’s intended to serve and the Data Developers building it know a lot of the technical considerations. This information is held in individual's minds, on wiki pages, in user stories, design documents and code repositories. The run-time platform contains a lot of metadata that enriches the Product Documentation after it has been deployed.
The self-serve aspect of documenting a Data Product relies on automation to pull that information from the various sources and attach it to the Data Product user-facing documentation system — the Catalogue.
Standards can be a blocker to rapid delivery if they’re too onerous to adopt and can limit innovation if they are inflexible and unadaptable. This proposal is intended to offer an easy to adopt and adaptable framework for defining Data Product documentation, but is not a minimum standard that must be completed in its entirety — it is intended to evolve with the needs of the Organisation.
Without precision, documentation can be ambiguous — definitions should use a Business Glossary — or taxonomy or ontology — to standardise Naming Conventions and Business Terms.
The minimum or mandatory information to get started is highlighted with [REQUIRED] and red in the diagram
(Information known when the Data Product is being designed and built)
[REQUIRED] The name of the Data Product. This should comply with Organisational naming standards and be intentional — i.e. the name should clearly relate to what the Data Product is. “Does what it says on the tin”
[REQUIRED] This should be a detailed description of what data is included in the Data Product. It should be as descriptive as possible. Where multiple ‘assets’ are part of the Data Product — e.g. Dimension and Fact tables — these should also be specified — with links to their respective documentation.
Availability/ Intended Audience
It’s worth highlighting that Data Products may have different intentions. A table, for example, may be created by a Data Engineering team for use by anyone wanting that Data. Equally, an ‘Analytics Engineer’ may create a table purely for use by their team.
If the data is filtered, the criteria should be expressly recorded.
Details of any partitions that are applied to help with read performance and efficiency.
Frequently Asked Questions
It’s helpful to consumers of a Data Product (actually any Product) to have some FAQs to refer to. The Product Manager should be able to define many of these at the Product definition stage, but these should be enriched with questions asked by users of the Data Product (think about capturing information from Slack conversations).
(Other messaging platforms exist of course!)
[REQUIRED] Which team can be contacted to answer questions about the Data Product or to whom bug reports or new feature requests can be submitted to.
This should ideally be hooked into either the email platform or Slack channel.
Feature Requests/ Bug Reports
How and where to raise feature requests and bug reports.
- Source Code
A link to where the source code of the Data Product is. This encourages transparency and openness, allows users to inspect the underlying code if they want without needing to engage with the Product Engineers and also allows other users to contribute features through an ‘Inner Source’ model.
- Project details
A link to the Jira project (other ‘thing’ trackers also exist).
- Developer Platform details
If applicable a link to the project on the Developer Platform.
[REQUIRED] Where is the Data Product located. The fully qualified location of the dataset — in the form of a link if possible.
If there are multiple locations of the data, this should be referenced here.
The environment — Product, Development, Staging — should also be clearly marked.
A link to the contract associated with the Data Product served on a specific platform.
Shameless plug to my post on Data Contracts
The Schema of the Data Product, including any relevant descriptions and metadata. This should correspond to the Version of the Product.
If the Data Product is dependent on any upstream Data Products, this should be highlighted here. Ideally this would be derived from the Data Platform lineage graph, but good to be explicit when the Data Product is defined.
Are there any Access Restrictions? How does a User get Access to the Data Product.
Does the Data Product contain any Sensitive Information — PII, PCI, etc.
How to respond to Right to be Forgotten (RTBF) or Subject Access Requests (SAR).
Are there any risks associated with this Data Product. Possibly more applicable to ML Products. Could be associated with known issues with upstream data that this Product relies on. Should include links to a Risk Register if it exists.
Are there any ethical concerns with this Data Product. Mostly applicable to ML Products. Have these been reviewed? If so, what was the conclusion?
What is the retention policy for the data.
Any business specific terms used in the documentation or applicable to the dataset. These can be defined here, or linked through to a Business Glossary.
No-one knows the Data Product better than the Data Product manager or the Data Product Engineers who built it. There should also be an understanding of how the Data Product should be used by Users and this should be presented in the form of Reference Queries — optimised queries that are well annotated.
Very much related to this is what keys are shared between related Data Products to allow efficient Joins. This should also include example
Service Level Objective/ Agreements
What are the Service Level Objectives of the Data Product. Are there any Service Level Agreements in place? This may be in the form of a more generic ‘tiering’ — e.g. this is considered a Tier 2 Data Product and Tier 2 Data Products can tolerate 1% downtime per year.
What Version of the Product is this. What are the previous Versions and when were they released. With Release Notes.
I recommend using semantic versioning — Semantic Versioning 2.0.0
What’s changed in the latest Version of this Data Product. What new capabilities are there and what bugs have been fixed.
Data Products have a life cycle. They go through a testing phase — perhaps a ‘beta’ phase — and at some point they will need to be deprecated. Valid values could be something like: ‘Beta’, ‘Generally Available (GA)’, ‘End of Life’.
(Information only known when the Data Product is ‘run’ in Production)
This is where a capable Metadata Platform is invaluable.
Data Profiling is a task that the majority of users of a Data Product will do most times they access a Data Product. Being able to provide up to date Data Profiles in advance is both extremely useful to Users as well as being cost effective by eliminating much of the compute cost associated with this task.
Data Profiling may be available from either a Data Quality tool or from the Metadata Platform.
Data Quality is one of the main drivers of Trust in a Data Platform. Any issues regarding Data Quality should be appended to the Data Product — and the history recorded — so that Users can quickly assess the Quality of a Data Product and decide whether it’s fit for their Use Case
The mechanics of Data Quality are outside the scope of this paper.
Data Availability (Last update timestamp)
When was the Data Product last updated? This will be subtly different between Streaming and Data-at-Rest. For Streams it may be the last message rate data — e.g. n messages processed in the last window, while for tables it might be data was written at timestamp. Bonus points for providing partition information where appropriate.
Parse the query logs from the compute platform and extract the query data.
Common queries can then be appended to the Data Product (Table) metadata.
This is valuable both from a User’s perspective by understanding who else is querying that table and how, but to the Data Product Manager in terms of Product Usage metrics as well as the Data Platform team who can identify inefficient queries and proactively reach out to help optimise expensive queries before they incur too much cost
Data Lineage — downstream usage of a table as opposed to Provenance (upstream sources) — is also possible having parsed the relevant Query Logs. Lineage also helps the owner of the Data Product identify who is impacted by upcoming or proposed changes and notify them well in advance.
Most queries are likely to contain Joins. The parsed query log data can give an indication of which Tables are most likely to be joined with the Data Product, enabling the beginnings of a Knowledge Graph.
Ease of Adoption
It has to be easy to adopt any standards for Data Product documentation. To make this easy to adopt, emphasis should be placed on automation.
Tasks could be added to the deployment pipelines of data ‘assets’ to create a basic
readme.md in the relevant GitHub repository seeded with sufficient information to create a ‘stub' of documentation. This would include Name, Description and Owner at least. That ‘stub’ would be picked up by the Data Catalogue and a task could then be raised automatically on the team's Jira board to enrich the Data Product documentation.
Run Time enrichment
Automation from the Data Platform team would append the run time information to the Data Product.
Thank you for making it this far!
The idea behind all of this is to propose perhaps, the template for the proverbial Library Card. No doubt this will vary by Organisation and Platform, but I hope this was of interest and maybe even help as Organisations embark on their Data-as-a-Product journey.
Please let me know what you think.
# Build Time
<!-- Information known when the Data Product is being designed and built) -->
## Name [REQUIRED]
<!-- The name of the Data Product -->
## Description [REQUIRED]
<!-- This should be a detailed description of what data is included in the Data Product -->
## Availability/ Intended Audience
<!-- Who is the intended audience -->
## Filters/ Exclusions
<!-- The criteria should be expressly recorded -->
## Partition Strategy
<!-- Details of any partitions that are applied -->
## Frequently Asked Questions
<!-- Common Questions -->
## Owner [REQUIRED]
<!-- Which team can be contacted to answer questions about the Data Product -->
## Feature Requests/ BugReports
<!-- How and where to raise feature requests and bug reports -->
### Source Code
<!-- Code repo -->
### Project details
<!-- A link to the project board -->
### Developer Platform details
<!-- A link to the project on the Developer Platform -->
## Location [REQUIRED]
<!-- The fully qualified location of the dataset -->
<!-- A link to the contract associated with the Data Product served on a specific platform -->
<!-- The Schema of the Data Product -->
## Data Provenance
<!-- Upstream dependency graph -->
## Access Requirements
<!-- How does a User get Access to the Data Product -->
<!-- Personally identifiable information (PII), personal health information (PHI), and payment card industry (PCI) -->
## Regulatory compliance
<!-- How to respond to Right to be Forgotten (RTBF) or Subject Access Requests (SAR) -->
<!-- Are there any risks associated with this Data Product -->
<!-- Are there any ethical concerns with this Data Product -->
## Retention Policy
<!-- What is the retention policy for the data -->
## Glossary Terms
<!-- Any business specific terms used in the documentation or applicable to the dataset -->
## Reference Queries
<!-- Optimised, well-annotated queries -->
## Shared Keys
<!-- Keys shared between related Data Products -->
## Service Level Objective/ Agreements
<!-- What are the Service Level Objectives of the Data Product -->
<!-- Version of the Product -->
### Release Notes
<!-- What's changed in the latest Version of this Data Product -->
<!-- 'Beta', 'Generally Available (GA)', 'End of Life' -->
# Run Time
<!-- Information only known when the Data Product is 'run' in Production -->
## Data Profile
<!-- Metatdata statistics at the column/ field level -->
## Data Quality
<!-- Data Quality information -->
## Data Availability (Last update timestamp)
<!-- Ladt update timestamp -->
## Common Queries
<!-- Common queries from the main SQL platforms -->
## Data Lineage
<!-- Downstream usage of a dataset -->
## Related Tables
<!-- Datasets closely related to the Data Product -->