Data Discovery with DataHub

11 min readApr 19, 2023

In today’s data-driven world, companies generate and collect massive amounts of data that can be leveraged to gain insights into their business operations, customers, and competitors. However, extracting valuable insights from this data requires the right tools and technologies. This is where data discovery platforms come into play.

Data discovery platforms are powerful tools that allow businesses to access, search, and analyze large amounts of data from various sources in real time. They provide an intuitive and user-friendly interface that enables users to discover, visualize, and share insights quickly and easily.

In this article, I will introduce you to DataHub, The #1 Open Source Data Catalog.

What is DataHub?

Courtesy of image https://github.com/datahub-project/datahub

DataHub is a modern data catalog built to enable end-to-end data discovery, data observability, and data governance. This extensible metadata platform is built for developers to tame the complexity of their rapidly evolving data ecosystems and for data practitioners to leverage the total value of data within their organization.

Installation and Deployment

DataHub offers self-service and managed deployment options. The self-service is the open-source selection and we are going to proceed with this option.

Installing DataHub CLI;

python3 -m pip install - upgrade pip wheel setuptools
python3 -m pip install - upgrade acryl-datahub

Running the DataHub container;

datahub docker quickstart

After the deployment is done, we can access DataHub through our browser;

http://localhost:9002/

DataHub Architecture Overview

Courtesy of image https://datahubproject.io/docs/architecture/architecture

There are three main highlights of DataHub’s architecture;

Schema-first approach to Metadata Modeling

DataHub’s metadata model is described using a serialization-agnostic language. Both REST, as well as GraphQL API-s are supported. In addition, DataHub supports an AVRO-based API over Kafka to communicate metadata changes and subscribe to them.

Stream-based Real-time Metadata Platform

DataHub’s metadata infrastructure is stream-oriented, which allows for changes in metadata to be communicated and reflected within the platform within seconds. You can also subscribe to changes happening in DataHub’s metadata, allowing you to build real-time metadata-driven systems. For example, you can build an access-control system that can observe a previously world-readable dataset add a new schema field that contains PII, and locks down that dataset for access-control reviews.

Federated Metadata Serving

DataHub comes with a single metadata service (gms) as part of the open-source repository. However, it also supports federated metadata services which can be owned and operated by different teams — in fact, that is how LinkedIn runs DataHub internally. The federated services communicate with the central search index and graph using Kafka, to support global search and discovery while still enabling decoupled ownership of metadata. This kind of architecture is very amenable for companies who are implementing data mesh.

What makes DataHub different?

DataHub is not the only open-source data catalog tool in the market. There are other tools developed by big organizations that are used by many companies; e.g. Amudsen, Apache Atlas, etc. Besides the technology comparisons, DataHub has positioned itself to another level through the capabilities below;

Linkedin effect

Linkedin is one of the few companies that I am aligned with their mission and vision. Besides their strong commercial product, their contribution to the open-source community is mind-blowing. As a data leader over the last 15 years, my path has crossed many teams with the tools they developed such as Kafka.

Strong and connected community

It is impossible to be impressed by the commitment and continuous contribution of the DataHub community. Most of the open-source developments give their contributors the to communicate over Slack but I haven’t seen a continuous dedication to Monthly Town Halls before. If you check the previous meeting notes, you can easily start to compare them to the commercial product teams that don’t do half of these efforts!

Feature and roadmap transparency

Open-source projects are easy to start but hard to scale up in a sustainable way. Many open-source projects face the same consequences that are not embraced by the community after a period of time. DataHub accelerates at this point and makes its roadmap not only transparent but also up to date.

Managed demo service

As a technical leader when I want to deep dive into a product, I like to read through the documentation, install it on my laptop in an isolated environment, and test the capabilities quickly. Most of the time this strategy works fine but sometimes due to dependency constraints or hardware problems, this can be a nightmare.

Also, I don’t like much about the “quickly book a demo session” commercial approach that creates too much artificial pressure and commitment over the session with some tech and salespeople from that company.

DataHub serves as a perfect solution for the people like me; https://demo.datahubproject.io/. They give you a fully managed service with many installations done upfront so you can play with the tool on your own. Such a victory for introverted technology geeks like me that something we have been craving for a long time!

End-to-End Data Discovery Project with DataHub

Database Configurations

In this project, I will use my local Postgres database as the main data source to connect and scan.

If you don’t have a Postgres database on your local machine, you can install it from this link. Also for the sample database, I will use the DVD rental database. The sample database has the tables below;

There are 15 tables in the DVD Rental database:

actor — stores actors’ data including first name and last name.
film — stores film data such as title, release year, length, rating, etc.
film_actor — stores the relationships between films and actors.
category — stores film’s categories data.
film_category- stores the relationships between films and categories.
store — contains the store data including manager staff and address.
inventory — stores inventory data.
rental — stores rental data.
payment — stores customer’s payments.
staff — stores staff data.
customer — stores customer data.
address — stores address data for staff and customers
city — stores city names.
country — stores country names.

Example 1: Metadata ingestion

DataHub supports both push-based and pull-based metadata integration.

Push-based integrations allow you to emit metadata directly from your data systems when metadata changes, while pull-based integrations allow you to “crawl” or “ingest” metadata from the data systems by connecting to them and extracting metadata in a batch or incremental-batch manner. Supporting both mechanisms means that you can integrate with all your systems in the most flexible way possible.

DataHub supports various types of ingestions; e.g. Postgres, BigQuery, GreatExpectations, dbt, etc. Ingestions can be done either through the UI or programmatically. Due to simplicity, I will use the UI approach;

On the upper-right panel, we click on the Ingestion button.

As we clicked the Ingestion button, we are introduced by an Ingestion wizard. The first step is to define the source we are going to emit the metadata. As you can see below DataHub supports a wide range of resources from reporting tools to event store platforms.

We search for Postgres and proceed.

In the second step, we are setting up our connection parameters.

In the Advanced section, “Enable Column Profiling” is default unselected. I select that parameter to generate column profiling.

In the third step, we are configuring our scheduling parameters.

In the fourth and last step, I give the name of the ingestion and that is all.

After you Save&Run, automatically a job is going to be populated to run the metadata ingestion from Postgres. Below I share a screenshot of how I ran various times over the same source;

Example 2: Post-ingestion analyses

As we ran the ingestion, the next step is to view the information retrieved from our data source. In order to do that, we need to go back to the home page. At the home page, the icon below will pop up to indicate that we have successfully ingested Postgres and there is some information underlying.

When we click on the image, we are guided to the homepage of the related source. On this page, we can filter our databases, schemas, tables, etc.

As an alternative, we can use the search bar to get the intended information, like “Google for Data”.

Example 3: 360-degree view of our table

DataHub support easy-to-navigate, manage and use UI for the users. When we proceed to the table section, we are welcomed with various information about our table.

On the left panel, we have sections starting with Schema to Validation. These sections help us to understand all the relevant information about our table. On the right panel, we can add more metadata about the table such as Owners, Tags, and Glossary Terms.

In the Schema section, we can understand the column types and their relative integrity information.

The Documentation tab enables us to enter any semantic information about the table and potential links.

The Stats section captures the descriptive information about each column with sample values.

Example 4: Schema change detection

Due to the active metadata collection architecture, DataHub can collect any metadata changes in the sources through the defined ingestions. In order to demonstrate this capability, I will create a dummy table in the database and alter the column information between different ingestions.

-- Creating new table 
create table schema_change_test (a int, b int);

insert into schema_change_test values (1,2);

-- Alter 1: changing column information
alter table schema_change_test alter column b type float;

-- Alter 2: Dropping column
alter table schema_change_test drop column b;

When I proceed to the table, in the first Schema section I see only column “a”. In order to see the versions of the table, I click the item on the most right under the Schema section. By default, we see the latest version of the schema.

If we want to see the previous versions of the table, we can just select the intended one. Below, I selected the initial version and we can see column “b”.

Example 5: Data Landscape and DataHub usage statistics

DataHub not only collects metadata from the resources but also applies the same approach to collect platform usage statistics. On the top right section, we can click to Analytics button to get the full view of the available statistics about usage statistics.

Example 6: User group and permission management

DataHub enables users and permission management through the platform itself. We can invite users to the platform, define groups, and assign certain permissions to those roles and policies.

Example 7: DataHub Lineage

Lineage is used to capture data dependencies within an organization. It allows you to track the inputs from which a data asset is derived, along with the data assets that depend on it downstream.

In order to demonstrate the lineage capabilities, I will use open-source Covid data set in BigQuery. Below you can see the Lineage section at our table. In the Filter section, we can choose which Degree of Dependencies we seek.

If we want to visualize the lineage, we just need to click Visualize Lineage button.

For further lineage capabilities and recorded demo sessions, you can check this link.

Conclusion

In conclusion, if you are looking for a reliable and efficient data catalog platform, DataHub is an excellent choice. With its user-friendly interface, powerful search capabilities, and seamless integration with various data sources, DataHub makes it easy to find, access, and manage data assets within your organization. Whether you are a data analyst, scientist, or engineer, DataHub can help you save time and effort by streamlining the data discovery process. With its intuitive features and robust functionality, DataHub is sure to become an indispensable tool for your organization’s data-driven success.

Thanks a lot for reading 🙏

If you are interested in data quality and data product topics, you can check my other articles;

Timeless Obstacle for Data Products: Data Quality

Data products are the future! Today we are surrounded by various AI and ML-generated data products supporting our…

medium.com

Why Data Reliability Should Be the Top Priority: Understanding the Importance and Benefits

Reliability is an important aspect of our daily lives. We rely on many things and people to function smoothly and…

medium.com

Why Does “Data as a Product” Need Data Discovery?

It is contradictory how we develop data products for external consumers and internal consumers. The external consumers…

medium.com

Data Product Revolution

In the last 15 years, there is a huge increase in Data Product development; e.g. Netflix Recommendation Engine, Spotify…

medium.com

If you want to get in touch, you can find me on Linkedin and Mentoring Club!

Data Discovery with DataHub

What is DataHub?

Installation and Deployment

DataHub Architecture Overview

What makes DataHub different?

End-to-End Data Discovery Project with DataHub

Database Configurations

Example 1: Metadata ingestion

Example 2: Post-ingestion analyses

Example 3: 360-degree view of our table

Example 4: Schema change detection

Example 5: Data Landscape and DataHub usage statistics

Example 6: User group and permission management

Example 7: DataHub Lineage

Conclusion

Thanks a lot for reading 🙏

Timeless Obstacle for Data Products: Data Quality

Data products are the future! Today we are surrounded by various AI and ML-generated data products supporting our…

Why Data Reliability Should Be the Top Priority: Understanding the Importance and Benefits

Reliability is an important aspect of our daily lives. We rely on many things and people to function smoothly and…

Why Does “Data as a Product” Need Data Discovery?

It is contradictory how we develop data products for external consumers and internal consumers. The external consumers…

Data Product Revolution

In the last 15 years, there is a huge increase in Data Product development; e.g. Netflix Recommendation Engine, Spotify…

Written by Seckin Dinc