Google Cloud Data Catalog — Keep Up With Your On-Prem Hive Server
Code samples with a practical approach on how to ingest metadata from an on-premise Hive server into Google Cloud Data Catalog
Disclaimer: All opinions expressed are my own, and represent no one but myself…. They come from the experience of participating in the development of fully operational sample connectors, available at: github.
We hardly find large organizations storing all their data at the same place, sometimes it’s because of compliance or even a strategic reason. That leads to many customers having their data assets spread across multiple silos, dealing with data that resides on hybrid clouds and/or on-premise environments.
By taking a metadata management perspective, we need to enable data discovery in a centralized place, no matter where the data is.
Last year Google Cloud announced their metadata management tool
But how do they deal with the given challenge?
Google Cloud take on this
Recently Google Cloud released the Data Catalog custom entries API, allowing users to ingest virtually anything into Data Catalog. It’s a flexible API with open string fields, that allow users to specify the type of their assets, names, and schemas with their data types.
There are code samples for NodeJS, Java and Python. Please take a look at the official docs, if you want to know more about it.
The Hive Server
There are many ways of having a Hive server running on production, and connecting to it. In this blog post, we will work with the usually recommended production environment, where we have separate workloads for the hive server, hive metastore, and the underlying RDBMS (stores the metadata on behalf of hive).
If you want to quickly set up a Hive environment for testing/dev purposes check this post.
It’s also important to have a large amount of test data to simulate a more realistic scenario, so we will be using a pre-populated Hive server with ~
1000 assets. To generate the data, this script was used, which creates random tables, with diverse column types.
Let's take a look at it
Then connecting to the
Now let’s see the table
Hive to Data Catalog connector
To ingest all that metadata from Hive to Data Catalog, we will use a Python script (referenced as
connector in this blog post), divided into 3 Steps: Scrape, Prepare and Ingest.
At this stage, the script connects to the source system — a PostgreSQL database in this case and retrieves the metadata using a SQL query, following the Hive metastore schema definition.
It’s important to point out that this script does not access the actual data when it connects to the RDBMS metastore, it retrieves only the metadata, which are the databases, tables and columns definitions.
The transform logic happens here, where the source system assets are converted into Data Catalog assets.
Finally, the in-memory Data Catalog assets, are synced and persisted into the specified Google Cloud Data Catalog instance,
deleted as needed according to the source system state. The Data Catalog custom types API is called for that.
Executing the connector
After setting up the
connector environment, by following the instructions at the Github repo, let’s execute it using its command line args:
# Environment variables
export GOOGLE_APPLICATION_CREDENTIALS=<CREDENTIALS_FILES_FOLDER>/hive2dc-datacatalog-credentials.json# Command line execution
Once the connector finishes we can go into Data Catalog search UI and look for the ingested assets
Same 8 databases shown on the Hive Server. And if we search for tables
Now let’s see the table we saw before
Filtering the column
We can then use Data Catalog Tags to mark it as
Finally, let’s look at some metrics generated from the execution. Metrics were collected by running the
connector on a Hive Metastore version
2.3.0, backed by a PostgreSQL database instance populated with
993 tables distributed into
The following metrics are not a guarantee, they are approximations that may change depending on the environment, network and execution.
The sample connector
All topics discussed in this article are covered in a sample connector, available on GitHub: hive-connectors. Feel free to get it and run according to the instructions. Contributions are welcome, by the way!
It’s licensed under the Apache License Version 2.0, distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
In this article, we have covered how to ingest metadata from Hive into Google Cloud Data Catalog, enabling users to centralize their Metadata management, even when it resides within an on-premise environment. Stay tuned for new posts showing how to do the same with other source systems! Cheers!
- Connector Github Repo: https://github.com/GoogleCloudPlatform/datacatalog-connectors-hive
- Data Catalog GA blog post: https://cloud.google.com/blog/products/data-analytics/data-catalog-metadata-management-now-generally-available
- Data Catalog official docs: https://cloud.google.com/data-catalog/
- Code Samples: https://cloud.google.com/data-catalog/docs/how-to/custom-entries#data-catalog-custom-entry-python
- Create Hive test/dev env: https://dev.to/mesmacosta/quickly-set-up-a-hive-environment-on-gcp-38j8
- Create Hive test data: https://dev.to/mesmacosta/using-python-to-create-hive-tables-with-random-schema-2e5o