Introducing Data Product Portal: An open source tool for scaling your data products

Published in

conveyordata

6 min readJun 28, 2024

In the fast-evolving world of data, companies are discovering that the key to success for scaling their data initiatives is not to rely on a single data team to handle all requests from the business. Instead they want to enable self-service capabilities to allow each domain or department to build their own data products at their own pace with their own budgets.

However, putting this vision into practice can be very challenging, especially when you put all aspects related to data governance, data platforms and data catalogs together. It can become messy very quickly as it is hard to provide an easy to understand consistent view across all these dimensions and technologies.

Today we are very happy to announce the Data Product Portal — an open-source tool to help organisations build and manage data products at scale. It’s intuitive, flexible, and geared towards making data product management straightforward and effective.

What problem does it solve

Imagine you’re building a data pipeline. You take some input data, process it with python, dbt or another tool and generate output for others to use. This scenario applies whether you’re using Snowflake, AWS, Databricks, BigQuery, Microsoft Fabric or Starburst.

Typical data pipeline consuming remote data, data from files and databases

In large organisations, you typically have multiple departments or domains, and you can’t simply share all data between all departments by default due to legal, compliance, regulatory or confidentiality reasons. This means that departments or domains have to keep control how and why their data is being used by others.

Scaling data pipelines across a large organisation with multiple departments

The moment you start sharing data across multiple departments or domains, you are immediately faced with the following questions:

Who has access to what data, when can they use it and for what reason are they using it?

If you are working with data, you want to know where to access and find the data and tooling to build your data pipelines.

To answer these questions, companies start data governance initiatives, where they start generating policies per user to manage access to data in tools like Ranger, AWS IAM, Snowflake or any other tool. When people have been working on multiple use cases, all of these policies accumulate together. This “spaghetti” of permissions will lead towards very broad access permissions for a large group of people. After a while you will end up in the situation you wanted to avoid in the first place where:

Everybody ends up with access to all data and you no longer know why and how your data is being used.

Data products as a governance model

There are many different definitions out there about what is a data product and every organisation or person working with data has their own opinion about it, but we found the following definition to be useful and quite universal for many organisational structures.

We propose to define a data product as: an initiative with a clear goal, owned by a department or domain of the business that consists of the combination of:

Input data: Access to datasets created by a combination of other data products.
Output data: Read/write access to output data that can be combined into a dataset that can be shared with other data products. This data is stored in specific locations (e.g. databases or buckets).
Private data: A safe location to store private/internal data for local processing with no intention of sharing
Tools and logic: All code and outputs describing your transformations, scheduling and tooling configuration needed to build, access and run your data pipeline separated from other data products.
Team roles: Defined roles of team members that have specific permissions on how to interact with the data product (e.g. data product owner, data engineer, business analyst).

It is important to note that people can work on multiple data products, but have to choose on which data product they are working on. They only get access to scope related to the tools and data for that data product.

In this definition, data products are not only data assets that are the output for sharing with others, but also the tooling, artefacts and roles of people that interact with that data product.

When multiple data products start interacting, your data product lineage will start looking like this:

How different data products are interacting with each other

The main benefits of this approach are:

Clear Data Usage: You always know who is using your data and why.
Simplified Access Management: Easier to handle access requests and data revocations
Natural Data Lineage: Understand how your data flows from one data product to another.

Adopting the data product governance model is very powerful in scaling data initiatives across departments, while still keeping control and self-service capabilities. However, this governance model is only useful if you also have something that manages and applies these principles.

Introducing the Data Product Portal

The Data Product Portal is a practical tool that helps you build data products at scale. It is both useful for both people working with data and people who want to have control over how their data is being used.

It is designed to simplify the creation and management of data products at scale. It is beneficial and easy to use for both data professionals and those overseeing the use of data and data governance.

Data product model translation to policies and configuration

This is where the Data Product Portal comes into play. It helps you translate these concepts to a practical implementation that is consistently following that model across tools and technologies.

Guided Setup: Step-by-step assistance involving the right stakeholders for creating data products, requesting access, adding users and registering new data for sharing with other data products.
Tech Translation: Converts high-level concepts into specific configurations settings for platforms like AWS, Azure, Databricks, Snowflake, and others, making sure that each data product is correctly separated and not impacting each other.
User-Friendly Interface: Makes it easy for business users and people working with data to understand and navigate the data landscape.
Self-service: Enables departments and teams to start new data initiatives easily without having to depend on a central team.
Comprehensive Overview: Combines data catalogs, data platforms and data governance aspects into a single 360 overview of all ongoing data initiatives.

How to get started

The Data Product Portal is available as an open source project on Github. Getting started is as simple as running docker compose up and visiting localhost:8080

We invite you to check us out and give us a star if you like what you see. Your contributions are invaluable to us — whether it’s through feedback, suggestions, or direct involvement in development.

For Kubernetes deployments, check out our helm chart here.

If you have questions or want to share your thoughts? Join our community on Slack and connect with us directly. We can’t wait to hear from you!

We are excited to bring the Data Product Portal into your hands as an open source initiative. Don’t hesitate to share with us what you think about it!