Why Does “Data as a Product” Need Data Discovery?

Published in

Data And Beyond

7 min readApr 17, 2023

It is contradictory how we develop data products for external consumers and internal consumers. The external consumers are our customers, clients, or end-users who are the foundation of the profit we generate. The internal consumers are data, product, operation, or marketing teams who are the foundation of the organization to build successful products to make a profit. While we pay enormous attention to our external consumers, which is something we should do all the time, we don’t care that much for our internal consumers.

According to Bill Inmon, the data and business units, the internal consumers, were divorced a long time ago. As a data leader, I can find various reasons why data teams disappointed the business units in the last years. In my opinion, the most important reason is that they lost trust in the data we populated and how we serve it. The trust was broken due to continuously changing KPI definitions and measurements, never-stable dashboards, black-box data science teams, machine learning products, no proper lineage of the data flows, and so on.

Ensuring data quality with data testing and validation, establishing data reliability with data observability and service level agreements are the measures data teams can take behind the scenes, as data team back-end services. If we think of our data organization as a product, having strong back-end services is not enough. We also need easy-to-use, engaging, and life-easing front-end services to engage and retain our internal consumers and support them with self-service toolings.

In this article, I will introduce the basics of metadata management, data catalogs, why they are not used, why we need “Google for Data” platforms through data discovery, and how data catalogs are evolving into data discovery platforms.

Metadata Management

As a data consumer, have you ever found yourself trying to answer pretty easy questions about your data but you got more questions during your expedition; e.g. what is the meaning of customer_id, what is the difference between user_id and customer_id, how do we populate created_at and updated_at columns during a user converts into a customer, etc? If you can’t answer this kind of question on your own without too much effort, then it means that your organization has serious data management issues.

Before we deep dive into the data discovery topic, we need to learn the basics of metadata management.

What is metadata?

Metadata is data about data — it provides information that helps you understand what your data is, where it comes from, how it’s structured, and how it’s used. This can include things like the data’s name, description, format, source, quality, lineage, and so on.

For example, if you have a digital photo, the metadata might include information like the date and time the photo was taken, the type of camera that was used, the resolution, and even the location where the photo was taken.

What is a data catalog?

In simple terms, a data catalog is like a library catalog for data. It’s a centralized inventory of all the different types of data that an organization has, including information about where the data came from, how it was collected, what it represents, and who is allowed to access it.

You can think of a data catalog as a metadata management tool. A data catalog uses metadata to create a searchable, organized inventory of an organization’s data assets. By collecting, managing, and presenting metadata in a standardized way, a data catalog makes it easier for users to discover and understand the data they need to do their jobs.

How to produce metadata for the data catalog?

The term “shift left” originally comes from the software development industry and refers to the idea of moving testing and quality assurance processes earlier in the development cycle. In the context of data catalog tools, “shift left” refers to the practice of starting metadata management and governance processes as early as possible in the data lifecycle.

By shifting metadata management and governance processes to the left — i.e., earlier in the data lifecycle — organizations can improve data quality, reduce the risk of compliance violations, and make it easier for users to find and use data effectively.

For example, a data catalog tool might incorporate shift left capabilities by providing automated metadata discovery features that scan and classify data assets as they are created or ingested, rather than waiting until the data has already been processed. This can help to ensure that metadata is accurate and up-to-date from the outset, and can make it easier to find and use data effectively.

Why is the Adoption of Data Catalogs Too Low?

Despite the potential of data catalogs, the adoptions are too low. Now and then we hear stories that organizations making big investments into data catalogs and data governance tools but stop the investment after 1–2 years of work. But why? Here are my thoughts;

For whom they are built: As always data teams make their usual mistake and develop their products without considering for whom they are building these products. Data catalogs are the perfect examples of ignorance of product thinking that can devastate a great idea and adoption. While we are asking our stakeholders to use the data catalogs more often so they don’t bother us now and then, we give them pretty ugly and impossible-to-use tools for non-technical folks.

No MVP Mindset: In any product development MVP mindset is much needed. People need to interact with the outcomes between the releases, collect as much feedback as possible, and you need to optimize the process. If you start the data catalog project with data governance and a GDPR mindset, either you are going to be fired after two years due to not providing any impact on the business or you are going to be the only person who admires the work all the time.

Lack of advertising from data teams: Every product should be advertised properly. We can’t expect the users to knock on our doors to get the information to use these products. Unfortunately, data teams don’t pay much attention to spreading the word about the benefits of data catalogs and what can be achieved if they are used in the correct forms. Many organizations may not fully understand the value of data catalogs in managing and using data assets.

Why Does “Data as a Product” Need Data Discovery?

Serving the data as a product is one of the most complex products a data team can build. The data team needs to build end-to-end data quality, data reliability, and observability functionalities to make sure the data is populated and served as expected.

Even though these are handled in the best possible way, the usage of the data product can create complexity. The end users may not understand the complex data warehouse or data lake concepts, the column definitions can be inappropriate, the linage of the columns can be hidden, and accessing required information can be worse than applying for citizenship as an ex-pat.

Today ChatGPT broke all the walls about the complexity of AI and brought its usage to millions of non-technical people around us. In the same manner, while we serve data as a product, we need to establish easy-to-use access and discovery capabilities over the data itself so that data scientists to marketing team members can benefit equally.

What is a data discovery platform?

A data discovery platform is a solution that enables organizations to explore and analyze their data to extract insights and gain a better understanding of their data assets.

Data discovery platforms typically offer a range of functionalities, such as data profiling, data visualization, data search, and machine learning, to help users explore and analyze their data more intuitively and efficiently. These platforms can help organizations uncover hidden patterns or insights in their data, and ultimately make better, data-driven decisions.

Data discovery platforms are especially useful for organizations that generate and collect large amounts of data from a variety of distributed sources, as they enable users to easily access, explore, and analyze this data, regardless of its format or location.

Data Discovery vs Data Catalogs

While both data discovery platforms and data catalog platforms share some similarities, they serve different purposes and offer different functionalities. Here are some key differences between the two;

Scope and focus: Data discovery platforms are primarily focused on exploring and analyzing data to uncover insights and patterns, while data catalog platforms are focused on organizing and managing data assets.
Data exploration vs. data management: Data discovery platforms are designed to support data exploration and analysis, often through data visualization tools, data profiling, and machine learning algorithms. Data catalog platforms, on the other hand, are designed to support data management and governance, often through metadata management, data lineage tracking, and data cataloging.
User audience: Data discovery platforms are typically used by data analysts and data scientists who need to explore and analyze data to extract insights and make informed decisions. Data catalog platforms, on the other hand, are typically used by data stewards, data managers, and other data professionals who need to manage and govern data assets across the organization.
Technical capabilities: Data discovery platforms often have more advanced analytical and visualization capabilities, as well as support for integrating and analyzing complex and large data sets. Data catalog platforms, on the other hand, often have more advanced metadata management capabilities, such as support for data lineage tracking, data quality profiling, and data classification.

Conclusion

In conclusion, migrating into data discovery platforms from data catalog platforms can bring several benefits to organizations. While data catalogs can help manage and organize data, data discovery platforms take it a step further by offering self-service access, data exploration, and insights. These platforms provide a more comprehensive view of the data and enable users to interact with it in real time, leading to faster and more informed decision-making. Furthermore, data discovery platforms often incorporate machine learning and artificial intelligence capabilities that allow for more advanced data analytics and predictive modeling. In today’s data-driven business landscape, organizations need to stay competitive by leveraging their data assets to their fullest potential. Migrating into data discovery platforms can help achieve this goal, providing a powerful tool for business growth and success.

Why Does “Data as a Product” Need Data Discovery?

Metadata Management

What is metadata?

What is a data catalog?

How to produce metadata for the data catalog?

Why is the Adoption of Data Catalogs Too Low?

Why Does “Data as a Product” Need Data Discovery?

What is a data discovery platform?

Data Discovery vs Data Catalogs

Conclusion

Written by Seckin Dinc