No more pain in finding the right data by Data Discovery, Data Catalog, and Data Lineage from Data Hub

Paul Anekpattanakij
6 min readNov 2, 2022

--

DataHub is a 3rd generation Metadata Platform that enables Data Discovery, Collaboration, Governance, and end-to-end Observability that is built for the Modern Data Stack.

How the Data Discovery important for data management in the organization?

Have you ever met a situation where a new data scientist joins the company, then is assigned to a project and gives a lot of requirements to build a new model?

Absolutely, he will have a question “where should he get all the related data?”

credit from unsplash.com

Easy and simple method, he will reach his colleague, Data scientist, Data Engineer, Data Analyst, or whomever that we can ask. Unfortunately, he may found or could not find related data, and although he found those data, he could not know where those data come from. It is identical to the source, or it has transformed already.

As a result, after he spent a lot of time asking a lot of people, he still could not have confidence that there is the right data to use.

Then the Data Catalog, Data Discovery, and Data Lineage will come to solve these issues.

What are Data Catalog and Data Discovery? And how they are different?

Oracle has defined Data Catalog as :

Simply put, a data catalog is an organized inventory of data assets in the organization. It uses metadata to help organizations manage their data. It also helps data professionals collect, organize, access, and enrich metadata to support data discovery and governance.”

Data Catalog is the way to organize the data in the organization by using metadata from the data in the organization, in the meantime, Data Discovery Simply put, surfaces a real-time understanding of the data’s current state as opposed to its ideal or “cataloged” state.

Example of metadata from public data set on Google Bigquery

That is the reason that we must have a data dictionary in the company that will help related teams, Data Scientists, Data Analysts, Business data users, etc. to explore data and use data efficiently with confidence in a shorter time. Additionally, most Data Dictionary platforms support dynamic updates automatically, then we can spend less time on old documentation way with more accurate documents.

Currently, there are tools from many companies that come up to solve this issue.

For example :

  • Amundsen from Lyft
  • Datahub from LinkedIn
  • Metacat from Netflix
  • Atlas from Apache
  • Data portal from Airbnb
  • etc.

Most of them are open-source projects, however, you have to ensure that you are picking the right platform for your organization because each one has its own capabilities and limitations.

For this example, we going to use Datahub, since it is easy to install base which is a docker base, additionally, the Datahub framework is highly configurable & scalable that are able to support small to enterprise companies.

First glance at DataHub

On the dataset view dashboard, we can see that it is able to show the schema of each table in detail, field type, description, and data owner of this dataset, then the user can research the data from a clear explanation and able reach its owner if they have any question about the data.

Setup DataHub

There are two pre-requiresite that you have to prepare for installation

  • python V3.x up
  • Docker

First, we have to install the package from the pip first

$ pip install acry-datahub

Then we have to pull the docker image as the below command. ( It may take a bit long time on this step. )

$ datahub docker quickstart

If it is installed successfully, you will see the below message.

✔ DataHub is now running
Ingest some demo data using `datahub docker ingest-sample-data`,
or head to http://localhost:9002 (username: datahub, password: datahub) to play around with the frontend.
Need support? Get in touch on Slack: https://slack.datahubproject.io/

After we have prepared the system successfully, then let's learn the concept of DataHub.

The core concept : Source and Sink

The concept is very simple : where are metadata come form ?( source ) and where is the destination? (sink) — We will put these information in a recipt to let Data Hub know how to manage these metadata for us.

Datahub is supported by the most frequent use of Data Warehouses, databases, or even BI tools as data sources already.

And we are able to set the Sink destination to be Datahub or JSON file.

Let's get started with Data Hub

First, we have to create an ingestion receipt for our datasources.

In the configuration, we can choose the data source type, such as Google Bigquery, then we have to put the project, and credential information for connecting to the data source.

We can set a schedule to let Datahub refresh metadata from the data source. This will keep our data catalog to be updated automatically without labour work anymore.

If we are able to import properly, we will be able to find the new dataset in the dataset list.

Not only Meta Data, Let explore Data Linage on Datahub

One of the cool features on datahub is data lineage, this will help data users to know a better picture of where these data come from and help the data engineer team to use them as information to investigate an issue.

If the company using Apache airflow already, we can install

$ pip install acryl-datahub[airflow]

Then add the data hub information in airflow and modify airflow.cfg to send these data to datahub, the data linage will be synced automatically.

Even if you do not use airflow, you are still able to create code to custom linage to ensure that you have proper information on the datahub.

Additional Tip: Pipeline concept in the datahub

https://datahubproject.io/docs/actions/concepts/

When Datahub processes the data from the source and sinks it to the platform. It is using pipeline concept which is a continuously running process that performs the following functions:

  1. Polls events from a configured Event Source (described below)
  2. Applies configured Transformation + Filtering to the Event
  3. Executes the configured Action on the resulting Event

Thus, we can add additional tasks when the data is processing e.g. We can filter and transform data before storing it on Datahub.

If you are interested in Datahub, you can try it at demo.datahubproject.io now.

Ref:

--

--