Google Cloud Data Catalog — Keep Up With Your On-Prem Hive Server

Code samples with a practical approach on how to ingest metadata from an on-premise Hive server into Google Cloud Data Catalog

Background by JJ Ying on Unsplash

The Challenge

We hardly find large organizations storing all their data at the same place, sometimes it’s because of compliance or even a strategic reason. That leads to many customers having their data assets spread across multiple silos, dealing with data that resides on hybrid clouds and/or on-premise environments.
By taking a metadata management perspective, we need to enable data discovery in a centralized place, no matter where the data is.

Data Catalog

Last year Google Cloud announced their metadata management tool

Data Catalog is a fully managed and scalable metadata management service that empowers organizations to quickly discover, manage, and understand all their data in Google Cloud

But how do they deal with the given challenge?

Google Cloud take on this

Recently Google Cloud released the Data Catalog custom entries API, allowing users to ingest virtually anything into Data Catalog. It’s a flexible API with open string fields, that allow users to specify the type of their assets, names, and schemas with their data types.

There are code samples for NodeJS, Java and Python. Please take a look at the official docs, if you want to know more about it.

The Hive Server

There are many ways of having a Hive server running on production, and connecting to it. In this blog post, we will work with the usually recommended production environment, where we have separate workloads for the hive server, hive metastore, and the underlying RDBMS (stores the metadata on behalf of hive).

Hive Environment

Test data

It’s also important to have a large amount of test data to simulate a more realistic scenario, so we will be using a pre-populated Hive server with ~1000 assets. To generate the data, this script was used, which creates random tables, with diverse column types.

Let's take a look at it

Databases

Then connecting to the on_prem_warehouse90281 database.

Tables — some tables were suppressed for better readability

Now let’s see the table school_infoa3c9b12a columns:

Columns — some columns were suppressed for better readability

Hive to Data Catalog connector

Connector Architecture

To ingest all that metadata from Hive to Data Catalog, we will use a Python script (referenced as connector in this blog post), divided into 3 Steps: Scrape, Prepare and Ingest.

Scrape

At this stage, the script connects to the source system — a PostgreSQL database in this case and retrieves the metadata using a SQL query, following the Hive metastore schema definition.

Prepare

The transform logic happens here, where the source system assets are converted into Data Catalog assets.

Ingest

Finally, the in-memory Data Catalog assets, are synced and persisted into the specified Google Cloud Data Catalog instance, updated/created/deleted as needed according to the source system state. The Data Catalog custom types API is called for that.

Executing the connector

After setting up the connector environment, by following the instructions at the Github repo, let’s execute it using its command line args:

# Environment variables
export DATACATALOG_PROJECT_ID=hive2dc-gcp-project
export DATACATALOG_LOCATION_ID=us-central1
export HIVE_METASTORE_DB_HOST=localhost
export HIVE_METASTORE_DB_USER=hive
export HIVE_METASTORE_DB_PASS=hive
export HIVE_METASTORE_DB_NAME=metastore
export HIVE_METASTORE_DB_TYPE=postgresql
export GOOGLE_APPLICATION_CREDENTIALS=<CREDENTIALS_FILES_FOLDER>/hive2dc-datacatalog-credentials.json
# Command line execution
google-datacatalog-hive-connector \
--datacatalog-project-id=$HIVE2DC_DATACATALOG_PROJECT_ID \
--datacatalog-location-id=$HIVE2DC_DATACATALOG_LOCATION_ID \
--hive-metastore-db-host=$HIVE2DC_HIVE_METASTORE_DB_HOST \
--hive-metastore-db-user=$HIVE2DC_HIVE_METASTORE_DB_USER \
--hive-metastore-db-pass=$HIVE2DC_HIVE_METASTORE_DB_PASS \
--hive-metastore-db-name=$HIVE2DC_HIVE_METASTORE_DB_NAME \
--hive-metastore-db-type=$HIVE2DC_HIVE_METASTORE_DB_TYPE

Results

Once the connector finishes we can go into Data Catalog search UI and look for the ingested assets

databases at Search UI

Same 8 databases shown on the Hive Server. And if we search for tables

Tables — some results were suppressed for better readability

Now let’s see the table we saw before school_infoa3c9b12a

Table school_infoa3c9b12a — some columns were suppressed for better readability

Filtering the column cpf61825

Column cpf61825

We can then use Data Catalog Tags to mark it as PII

Column cpf61825 tagged as PII

Execution Metrics

Finally, let’s look at some metrics generated from the execution. Metrics were collected by running the connector on a Hive Metastore version 2.3.0, backed by a PostgreSQL database instance populated with 993 tables distributed into 8 databases.

Metrics summary
Data Catalog API calls Drill down

The sample connector

All topics discussed in this article are covered in a sample connector, available on GitHub: hive-connectors. Feel free to get it and run according to the instructions. Contributions are welcome, by the way!

Closing thoughts

In this article, we have covered how to ingest metadata from Hive into Google Cloud Data Catalog, enabling users to centralize their Metadata management, even when it resides within an on-premise environment. Stay tuned for new posts showing how to do the same with other source systems! Cheers!

References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Marcelo Costa

software engineer & google cloud certified architect and data engineer | love to code, working with open source and writing @ alvin.ai