Google Cloud Data Catalog — Live Sync Your On-Prem Hive Server Metadata Changes
Code samples with a practical approach on how to incrementally ingest metadata changes from an on-premise Hive server into Google Cloud Data Catalog
Disclaimer: All opinions expressed are my own, and represent no one but myself…. They come from the experience of participating in the development of fully operational sample connectors, available at: github.
Entering the big data world is no easy task, the amount of data can quickly get out of hand. Look at Uber story, on how they deal with 100 petabytes of data using the Hadoop ecosystem, imagine if every time they would sync their on-premise metadata into a Data Catalog, a full run was executed, that would be impractical.
We need a way to monitor changes executed at the Hive server, and whenever a Table or Database is modified we capture just that change and incrementally persist it in our Data Catalog.
If you missed the last post, we showcased ingesting on-premise Hive metadata into Data Catalog, in that case we didn’t use an incremental solution.
To grasp the situation, a full run with ~
1000 tables took almost 20 minutes, even if only 1 table had changed. In the Uber story, that would be no fun, right?
Sidenote: This article assumes that you have some understanding of what Data Catalog and Hive are. If you want to know more about Data Catalog, please read the official docs.
Live Sync Architecture
There are multiple ways of listening to changes executed at a Hive server, this article compares two approaches: Hive hooks x Hive Metastore Listeners.
The Architecture presented uses a
Hive Metastore Listener, for the simplicity of having the metadata already parsed.
On-prem Hadoop environment side
The main component here is an agent written on Java which listens to 5 Metastore events:
It’s a really simple code that gets the event and sends it to a PubSub topic. For details on how to set it up and other events, please take a look at the GitHub repo.
The agent runs inside the Hive Metastore process, which must be in a network that is able to reach the Google Cloud Project, also the Service Account set up within it needs the Pub/Sub Publisher role in the topic.
Google Cloud Platform side
The main components here are PubSub and the Hive to Data Catalog connector.
- PubSub: works as a durable event ingestion and delivery system layer.
- Connector (Scrape/Prepare/Ingest): this layer transforms the Hive Metastore message into a Data Catalog asset and persists it — for details on how it works, please take a look at this post.
We also have Cloud Run, which works as a side-car web server receiving the message from PubSub and triggering the connector.
It’s a really simple code that calls the
Synchronizer class from
hive2datacatalog Python module, which triggers the
For details on how to set up the Cloud Run side-car please take a look at the connectors github repo.
Triggering the connector
Let’s create a new
table to see it working
Checking the Hive Metastore logs, we can see two messages sent to PubSub
Going to Cloud Run, we can look at the execution log
Finally, let’s open the new Entries using Data Catalog UI
In a matter of seconds, we are able to search for the newly created entries.
The sample connector
All topics discussed in this article are covered in a sample connector, available on GitHub: hive-connectors. Feel free to get it and run according to the instructions. Contributions are welcome, by the way!
It’s licensed under the Apache License Version 2.0, distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
In this article, we have covered how to incrementally ingest metadata from Hive into Google Cloud Data Catalog, in a scalable and efficient way, enabling users to centralize their Metadata management. Stay tuned for new posts showing how to do the same with other source systems! Cheers!
- Connector Github Repo: https://github.com/GoogleCloudPlatform/datacatalog-connectors-hive
- Data Catalog GA blog post: https://cloud.google.com/blog/products/data-analytics/data-catalog-metadata-management-now-generally-available
- Data Catalog official docs: https://cloud.google.com/data-catalog/
- Code Samples: https://cloud.google.com/data-catalog/docs/how-to/custom-entries#data-catalog-custom-entry-python
- Hive hooks x Hive Metastore Listeners post: https://towardsdatascience.com/apache-hive-hooks-and-metastore-listeners-a-tale-of-your-metadata-903b751ee99f