Dataplex : Accessing the Dataproc Metadata store via PySpark

Published in

Google Cloud - Community

6 min readDec 25, 2022

This article describes the methodology of how Metadata within Dataplex can be accessed via standard interfaces, such as Hive Metastore, to power Spark queries. We will specifically talk about accessing the gRPC-based Hive metastore in PySpark queries.

In this article we are going to talk about the following things:

Why gRPC and Dataproc Metastore store?
Activating Dataplex and grpc Dataproc Metastore in it.
Discovery UI explore tab and Modifying the discovery metadata with Pyspark
Search Metadata: Data Catalog component in Dataplex

Background

Dataplex is an Intelligent data fabric that enables enterprises to rapidly curate, secure, integrate, and analyze at any scale. Also, Dataplex does have a GCP data catalog as an inherited component. Thus solving enterprise metadata management needs out of the box.

Why gRPC and Dataproc Metastore store

Dataproc Metastore is a managed Apache Hive Metastore service. It offers 100% OSS compatibility when accessing database and table metadata stored in the service.

For example, you might have a table stored in Parquet files on Google Cloud Storage. You can define a table over those files and store that metadata in a Dataproc Metastore instance. Then you can connect a Cloud Dataproc cluster to your Dataproc Metastore service instance and query that table using Hive, SparkSQL, or other query engines.

gRPC is a modern open-source high-performance RPC framework that can run in any environment and can efficiently connect services in and across data centers.

Providing gRPC as an option to access Metastore brings us many benefits. Compared to traditional Thrift, gRPC supports streaming that provides better performance for large requests. In addition, it is extensible to more advanced authentication features and is fully compatible with Google’s IAM service that supports fine-grained permission checks. A path to integrate gRPC with Hive Metastore is sketched out in this proposal.

Metadata layer in Dataplex

Dataplex scans the following:

Table entity: structured and semi-structured data assets within data lakes, to extract table metadata into table entities
Fileset entity: unstructured data, such as images and texts, to extract fileset metadata into fileset entities
Partitions: Metadata for a subset of data within a table or fileset entity, identified by a set of key/value pairs and a data location.

You can use the Dataplex Metadata API to do either of the following:

View, edit, and delete table and fileset entity metadata
create your table or fileset entity metadata

You can also analyze Dataplex metadata through either of the following:

Data Catalog, for searching and tagging
Dataproc Metastore and BigQuery, for table metadata querying and analytics processing.

Dataplex Automatic Metadata Discovery Scan

To read more metadata management: https://cloud.google.com/dataproc-metastore/docs/overview

Activating Dataplex and gRPC Dataproc Metastore in it.

In this section, we are going to talk about how can we activate Metastore discovery on your lake.

Before you begin

Enable Dataproc, Dataproc Metastore, BigQuery, GCS, Data Catalog, and Dataplex APIs in your project.
Understand networking requirements specific to your project.
For more details on required roles and permission please refer https://cloud.google.com/dataproc-metastore/docs/iam-and-access-control

Enabling the required APIs for the project.

Creating a new lake, Zone and attaching Dataproc Metastore

Go to the Dataplex tab and Click create at the top to create a new lake “demo-dataplex” and Zone. For more information, see Creating a lake.
Post creation start staging your assets: For more details on staging your assets refer Adding your assets to Dataplex lake.
Create a Dataproc Metastore Service in the same project as your lake with gRPC enabled

NOTE: Each Dataproc Metastore can only be attached to a single Dataplex lake. To examine metadata across multiple lakes, either use the “Discovery” section in Dataplex or query the metadata in BigQuery.

Steps to create Metadata store in your project.

4. Attach your gRPC enabled Metastore to your demo-dataplex lake.

Attaching gRPC enabled Metastore to newly already created lake in Dataplex.

Dataplex’s Discovery UI explore tab

Dataplex’s discovery tab has two tabs i.e. Search and Explore. To explore the metadata store in DP gRPC metastore we have to use the explore tab.

Once you click on it; you will get options to create an environment for Notebook and Spark as shown below.

Open source tools can be used to interact with metadata stored with dataplex.

Accessing Metadata with Custom pySpark Job

Metadata managed within Dataplex can be accessed via standard interfaces, such as Hive Metastore, to power Spark queries. The queries run on the Dataproc cluster in the background.

One way to do it is to use the Auto-Infra provisioning feature under Dataplex Task UI. We can go to the Manage Lake → Manage → Process tab to submit our custom PySpark job stored in GCS as shown in the picture below.

Dataplex does take care of creating required infra for running the spark job in the background.

Submmiting our Custom pySpark Job on with Dataplex Auto-infra provisioning.

We will try to run some DMLs via pySpark code (spark SQL) on metastore. Dataplex does not allow you to alter the location of a table or edit the partition columns for a table. Altering a table does not automatically set user-managed to true.

In Spark SQL, you can rename a table, add columns, and set the file format of a table.

Each Dataplex zone within the lake maps to a metastore database.

Alter Table example: We will use below pySpark code stored in GCS to modify Metadata stored in gRPC metastore associated with Dataplex.

For Parquet data, set Spark property spark.sql.hive.convertMetastoreParquet to false to avoid execution errors.

Metadata view 1: Before Running the pySpark Job . We are going to check the metadata stored in BigQuery available via external tables from DP gRPC metastore.

Explorer tab : metadata visible in BigQuery before running pySpark.

Metadata view 2: After Running the pySpark Job. We can see that the table name has been changed for the underlying asset’s metadata.

Explorer tab : metadata visible in BigQuery after running pySpark.

We can see the alter table statement successfully able to change the table name. Similarly, we can do other modifications to metadata as shown here.

Search Metadata : Data Catalog component in Dataplex

The Discover experience is integrated with Data Catalog and shows you resources you have access to across all lakes within the specific project. To learn more, see the Discovering Data docs Search for the data you attached to a zone earlier and click on the asset .The Entity details page shows specific information about the selected asset, including links to the data

We will cover about this in more details in our next article.

Conclusion

In this article, we have discussed Activating Dataplex and attaching gRPC-enabled hive metastore to it in GCP. We also talked about how can we run Spark SQL to access/modify the Metastore data in Dataplex with a custom PySpark job. We also showed various UI components available in the Discovery feature of Dataplex.

References

Hope you liked this article and found it useful. Thanks Shashanktp

for your inputs. You can reach me out on LinkedIn.