DataHub — Why is it popular?

Published in

ReversingLabs Engineering

6 min readMar 7, 2023

https://pixabay.com/illustrations/big-data-abstract-7644538/

Introduction

We live in a world where data is one of the most important building blocks of most companies. As companies grow, and due to the needs of the dynamic market, the need for new services and platforms that use different types of data also grows.

Some of the challenges are:

where the data comes from and in what form
what is the nature of the data
what value do they bring to us
how the data are interconnected with other data, services, platforms, …
how to monitor continuously changing metadata across different systems
and more

Many companies use different platforms and repositories to store certain properties of the data they use, but this can create problems, such as:

how to get information of interest
how to access data of interest
dispersion of metadata across multiple platforms and repositories
interconnection between data and data, data and services, and so on

DataHub solves many of the these problems.

What is DataHub?

DataHub is the Metadata Platform for the Modern Data Stack. It is the one-stop shop for documentation, schemas, ownership, lineage, pipelines, and usage information. Data ecosystems are diverse — too diverse. DataHub’s extensible metadata platform enables data discovery, data observability, and federated governance that helps you tame this complexity.

History of DataHub

DataHub was originally built at LinkedIn and subsequently open-sourced under the Apache 2.0 License. It now has a thriving community with over a hundred contributors and is widely used by many companies.

What does DataHub offer?

Forward-Looking Architecture — DataHub follows a push-based architecture, which means it’s built for continuously changing metadata. The modular design lets it scale with data growth at any organization, from a single database under your desk to multiple data centers spanning the globe.
Massive Ecosystem — DataHub has pre-built integrations with systems like Kafka, Airflow, MySQL, SQL Server, Postgres, LDAP, Snowflake, Hive, BigQuery, and many others. The community is continuously adding more integrations, so this list keeps getting longer and longer.
Automated Metadata Ingestion — Push-based ingestion can use a prebuilt emitter or can emit custom events using Datahub’s framework. Pull-based ingestion crawls a metadata source. Ingestion can be automated using Datahub Airflow integration or another scheduler of choice.
Discover Trusted Data — Browse and search over a continuously updated catalog of datasets, dashboards, charts, ML models, and more.

DataHub metadata ingestion

Let’s talk about DataHub ingestion, ingestion constraints, and how to avoid some of them.

https://datahubproject.io/assets/images/ingestion-architecture-cd631d7c4a648ceb82908ce25b9f93b9.png

DataHub supports two metadata integration:

Push-based integrations — allow you to emit metadata directly from your data systems when metadata changes.
Pull-based integrations — allow you to “crawl” or “ingest” metadata from the data systems by connecting to them and extracting metadata in a batch or incremental-batch manner.

Supporting both mechanisms means that you can integrate with almost all your systems in the most flexible way possible.

Examples:

push-based integrations include Airflow, Spark, Great Expectations, and Protobuf Schemas. This allows you to get low-latency metadata integration from the “active” agents in your data ecosystem.
pull-based integrations include BigQuery, Snowflake, Looker, Tableau, and many others.

In the examples below we will use Python programming language and appropriate DataHub python modules. It is encouraged to create a python virtual environment.

Steps for creating and activating a Python virtual environment are:

python3 -m venv datahub-env             # create the environment
source datahub-env/bin/activate         # activate the environment
 
# Requires Python 3.7+
python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
datahub version
# If you see "command not found", try running this instead:
# python3 -m datahub version

Example 1.

Let’s say we want to ingest metadata from some SQL-like source, for example, MySQL.

Inside Python virtual environment, we need to do the following:

Install the appropriate plugin acryl-datahub — The MySQL source works out of the box with acryl-datahub
Write a recipe (.yaml) like:

source:
  type: mysql
  config:
    # Coordinates
    host_port: <host_1>:<port_1>
    database: <dbname_1>
 
    # Credentials
    username: <username_1>
    password: <password_1>
 
sink:
  # sink configs

3. Execute:

datahub ingest -c <path_to_yaml_recipe_from_step_2>

Example 2.

Let’s say we want to ingest another metadata from another MySQL instance.

We will do the same steps from Example 1. except we will write different recipes which will look something like this:

source:
  type: mysql
  config:
    # Coordinates
    host_port: <host_2>:<port_2>
    database: <dbname_2>
 
    # Credentials
    username: <username_2>
    password: <password_2>
 
sink:
  # sink configs

This will work in most cases, but what if we had the databases and tables with the same names in different MySQL instances? What would happen?

It would overwrite metadata with the latest metadata ingestion. The reason is behind how metadata is stored and identified within DataHub.

In our case, we use a datasets entity. In DataHub, datasets are identified by three pieces of information:

The platform that they belong to — is the specific data technology that hosts this dataset. Examples are a hive, bigquery, redshift, etc. In our case the platform is MySQL.
The name of the dataset in the specific platform — each platform will have a unique way of naming assets within its system. Relational datasets are usually named by combining the structural elements of the name and separating them by full stops: <db>.<schema>.<table>, with the exception of platforms like MySQL which do not have the concept of a schema — <db>.<table>. In cases where the specific platform can have multiple instances (e.g. there are multiple different instances of MySQL databases that have different data assets in them), names can also include instance ids, making the general pattern for a name <platform_instance>.<db>.<schema>.<table>.
The environment or fabric in which the dataset belongs — this is an additional qualifier available on the identifier, to allow disambiguating datasets that live in Production environments from datasets that live in Non-production environments, such as Staging, QA, etc. The full list of supported environments/fabrics is available in FabricType.pdl.

So, to differentiate metadata from the same platform but multiple instances, we need to add extra property inside the configuration block which will be platform_instance.

YAML recipe from Example 1 and Example 2 could look like this:

Example 1:

source:
  type: mysql
  config:
    # Coordinates
    host_port: <host_1>:<port_1>
    database: <dbname_1>
 
    # Credentials
    username: <username_1>
    password: <password_1>
 
    # Platform instances
    platform_instance: <host_1>
 
sink:
  # sink configs

Example 2.

source:
  type: mysql
  config:
    # Coordinates
    host_port: <host_2>:<port_2>
    database: <dbname_2>
 
    # Credentials
    username: <username_2>
    password: <password_2>
 
    # Platform instances
    platform_instance: <host_2>
 
sink:
  # sink configs

Example 3.

Let’s say we want to ingest metadata from some source that the DataHub ingestion framework doesn’t support at the moment. There are several ways to do it, and we will focus on Python REST Emitter.

First, inside your Python virtual environment, install the datahub-rest plugin like:

pip install -U 'acryl-datahub[datahub-rest]'

The next step is to write the appropriate script. For demonstration purposes, we wrote a simple script:

import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.metadata.schema_classes import ChangeTypeClass, DatasetPropertiesClass
 
from datahub.emitter.rest_emitter import DatahubRestEmitter
 
# Create an emitter to DataHub over REST
emitter = DatahubRestEmitter(gms_server="http://<datahub-gms-address>", extra_headers={})
 
# Test the connection
emitter.test_connection()
 
# Construct a dataset properties object
dataset_properties = DatasetPropertiesClass(description="Description of some table.",
    customProperties={
         "governance": "ENABLED"
    })
 
# Construct a MetadataChangeProposalWrapper object.
metadata_event = MetadataChangeProposalWrapper(
    entityType="dataset",
    changeType=ChangeTypeClass.UPSERT,
    entityUrn=builder.make_dataset_urn("some_dataset_platform", "some_platform_instance.some_database.some_table"),
    aspectName="datasetProperties",
    aspect=dataset_properties,
)
 
# Emit metadata! This is a blocking call
emitter.emit(metadata_event)

More details can be found at DataHub Python Emitter docs.

Final thoughts

As we can see, the DataHub platform is really powerful and flexible in the way metadata is entered. We can write our own metadata discovery mechanism and create an event for pushing it into DataHub or, for example, change an existing one. There are plenty possibilities for what we can do with metadata and how we can connect them. We can see a visual representation of data pipelines through the lineage feature and so much more.

References

[1]: Datahub — https://datahubproject.io/