Google Cloud Data Catalog — Keep Up With Your On-Prem Hive Server

Code samples with a practical approach on how to ingest metadata from an on-premise Hive server into Google Cloud Data Catalog

Background by JJ Ying on Unsplash

Disclaimer: All opinions expressed are my own, and represent no one but myself…. They come from the experience of participating in the development of fully operational sample connectors, available at: github.

The Challenge

We hardly find large organizations storing all their data at the same place, sometimes it’s because of compliance or even a strategic reason. That leads to many customers having their data assets spread across multiple silos, dealing with data that resides on hybrid clouds and/or on-premise environments.
By taking a metadata management perspective, we need to enable data discovery in a centralized place, no matter where the data is.

Data Catalog

Last year Google Cloud announced their metadata management tool

Data Catalog is a fully managed and scalable metadata management service that empowers organizations to quickly discover, manage, and understand all their data in Google Cloud

But how do they deal with the given challenge?

Google Cloud take on this

Recently Google Cloud released the Data Catalog custom entries API, allowing users to ingest virtually anything into Data Catalog. It’s a flexible API with open string fields, that allow users to specify the type of their assets, names, and schemas with their data types.

There are code samples for NodeJS, Java and Python. Please take a look at the official docs, if you want to know more about it.

The Hive Server

There are many ways of having a Hive server running on production, and connecting to it. In this blog post, we will work with the usually recommended production environment, where we have separate workloads for the hive server, hive metastore, and the underlying RDBMS (stores the metadata on behalf of hive).

Hive Environment

If you want to quickly set up a Hive environment for testing/dev purposes check this post.

Test data

It’s also important to have a large amount of test data to simulate a more realistic scenario, so we will be using a pre-populated Hive server with ~1000 assets. To generate the data, this script was used, which creates random tables, with diverse column types.

Let's take a look at it

Databases

Then connecting to the on_prem_warehouse90281 database.

Tables — some tables were suppressed for better readability

Now let’s see the table school_infoa3c9b12a columns:

Columns — some columns were suppressed for better readability

Hive to Data Catalog connector

Connector Architecture

To ingest all that metadata from Hive to Data Catalog, we will use a Python script (referenced as connector in this blog post), divided into 3 Steps: Scrape, Prepare and Ingest.

Scrape

At this stage, the script connects to the source system — a PostgreSQL database in this case and retrieves the metadata using a SQL query, following the Hive metastore schema definition.

It’s important to point out that this script does not access the actual data when it connects to the RDBMS metastore, it retrieves only the metadata, which are the databases, tables and columns definitions.

Prepare

The transform logic happens here, where the source system assets are converted into Data Catalog assets.

Ingest

Finally, the in-memory Data Catalog assets, are synced and persisted into the specified Google Cloud Data Catalog instance, updated/created/deleted as needed according to the source system state. The Data Catalog custom types API is called for that.

Executing the connector

After setting up the connector environment, by following the instructions at the Github repo, let’s execute it using its command line args:

# Environment variables
export DATACATALOG_PROJECT_ID=hive2dc-gcp-project
export DATACATALOG_LOCATION_ID=us-central1
export HIVE_METASTORE_DB_HOST=localhost
export HIVE_METASTORE_DB_USER=hive
export HIVE_METASTORE_DB_PASS=hive
export HIVE_METASTORE_DB_NAME=metastore
export HIVE_METASTORE_DB_TYPE=postgresql
export GOOGLE_APPLICATION_CREDENTIALS=<CREDENTIALS_FILES_FOLDER>/hive2dc-datacatalog-credentials.json
# Command line execution
google-datacatalog-hive-connector \
--datacatalog-project-id=$HIVE2DC_DATACATALOG_PROJECT_ID \
--datacatalog-location-id=$HIVE2DC_DATACATALOG_LOCATION_ID \
--hive-metastore-db-host=$HIVE2DC_HIVE_METASTORE_DB_HOST \
--hive-metastore-db-user=$HIVE2DC_HIVE_METASTORE_DB_USER \
--hive-metastore-db-pass=$HIVE2DC_HIVE_METASTORE_DB_PASS \
--hive-metastore-db-name=$HIVE2DC_HIVE_METASTORE_DB_NAME \
--hive-metastore-db-type=$HIVE2DC_HIVE_METASTORE_DB_TYPE

Results

Once the connector finishes we can go into Data Catalog search UI and look for the ingested assets

databases at Search UI

Same 8 databases shown on the Hive Server. And if we search for tables

Tables — some results were suppressed for better readability

Now let’s see the table we saw before school_infoa3c9b12a

Table school_infoa3c9b12a — some columns were suppressed for better readability

Filtering the column cpf61825

Column cpf61825

We can then use Data Catalog Tags to mark it as PII

Column cpf61825 tagged as PII

Execution Metrics

Finally, let’s look at some metrics generated from the execution. Metrics were collected by running the connector on a Hive Metastore version 2.3.0, backed by a PostgreSQL database instance populated with 993 tables distributed into 8 databases.

The following metrics are not a guarantee, they are approximations that may change depending on the environment, network and execution.

Metrics summary
Data Catalog API calls Drill down

The sample connector

All topics discussed in this article are covered in a sample connector, available on GitHub: hive-connectors. Feel free to get it and run according to the instructions. Contributions are welcome, by the way!

It’s licensed under the Apache License Version 2.0, distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Closing thoughts

In this article, we have covered how to ingest metadata from Hive into Google Cloud Data Catalog, enabling users to centralize their Metadata management, even when it resides within an on-premise environment. Stay tuned for new posts showing how to do the same with other source systems! Cheers!

References

--

--

--

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Recommended from Medium

.NET Core —Adding JWT Claims To Web API — Part 1 The Setup

Elasticsearch Cluster Migration Across Data centers

How TCP Works?

First 100 to move from Beefy and APESWAP will win 10 kUSDC

RxJava to Coroutines on Android by Example

Bifrost Dapp v2.0.1 version is under testing | Weekly Report 55

Booleans and If Statements

Migration and Modernization of HR Management Application / Oracle Database, from On-Prem…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Marcelo Costa

Marcelo Costa

software engineer & google cloud certified architect and data engineer | love to code, working with open source and writing @ alvin.ai

More from Medium

Dataproc Serverless for Spark on GCP

Reduce your BigQuery bills with BI Engine capacity orchestration

Performance Optimization with BigQuery

Why you need to aggregate your data and how BigQuery scheduled queries can help you