Processing and migrating large data tables from Hive to GCS using Java and Dataproc Serverless

Neha Sharma
Google Cloud - Community
3 min readOct 11, 2022

Dataproc is a fully-managed cloud service for running Apache Spark workloads over Google Cloud Platform. It creates transitory clusters instead of provisioning and maintaining a cluster for all our jobs.

The Dataproc Templates provide a flexible and easy-to-use mechanism for managing and executing use cases on Dataproc serverless without the need to develop them.

These templates implement common Spark workloads, letting us customize and run them easily.

Objective

This blog post elucidates on how processing and migrating large volumes of workload from Apache Hive Metastore to Google cloud works using Dataproc Serverless.

Pre-requisites

For running these templates, we will need:

  • The Google Cloud SDK installed and authenticated
  • A VPC subnet with Private Google Access enabled. The default subnet is suitable, as long as Private Google Access was enabled. You can review all the Dataproc Serverless networking requirements here.

Key Benefits

  • Use Dataproc Serverless to run Spark batch workloads without provisioning and managing your own cluster.
  • Hive to GCS template is open source, fully customizable and ready to use for simple jobs.
  • You can ingest data from Hive to GCS in AVRO, CSV, ORC and JSON formats.

Configuration Parameters

Following properties are included within the template to configure the execution —

  • spark.sql.warehouse.dir=<warehouse-path> : Location path to Spark SQL Hive Warehouse where Spark SQL persists tables.
  • hive.gcs.output.path=<gcs-output-path> : GCS input path.
  • hive.gcs.output.path=<gcs-output-path> : GCS output path.
  • hive.input.table=<hive-input-table> : Hive input table name.
  • hive.input.db=<hive-output-db> : Hive input database name.
  • hive.gcs.output.format=avro : GCS output file format. This can either be avro, csv, paraquet, json, orc. The default output path is set to avro.
  • hive.partition.col=<hive-partition-col> : Column name to partition hive data.
  • hive.gcs.save.mode=overwrite : Set the write mode to GCS. The default parameter value is overwrite. You can read about how each save mode behaves here.

There are two other optional properties with “Hive to GCS” template for applying spark sql transformations before loading into GCS—

--templateProperty hive.gcs.temp.table='temporary_view_name' 
--templateProperty hive.gcs.temp.query='select * from global_temp.temporary_view_name'

Note: When using the transformation properties, the name of the Spark temporary view and the name of the table in the query should exactly match to avoid “table view not found” error.

Usage

  1. Create a staging bucket in GCS to store the dependencies required to run the serverless cluster.
export GCS_STAGING_BUCKET=”my-gcs-staging-bucket”
gsutil mb gs://$GCS_STAGING_BUCKET

2. Clone the Dataproc Templates repository and navigate to the Java template folder.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java

3. Configure the Dataproc Serverless job by exporting the variables needed for submission —

We will use the provided bin/start.sh script and configure using the following mandatory environment variables to submit the job to dataproc serverless:

  • GCP_PROJECT : GCP project id to run Dataproc Serverless on.
  • REGION : Region to run Dataproc Serverless in.
  • SUBNET : Subnet where Hive warehouse exists so as to launch the job in same subnet.
  • GCS_STAGING_BUCKET : GCS staging bucket location, where Dataproc will store staging assets (should be within the bucket created earlier).
# GCP project id to run the dataproc serverless job
GCP_PROJECT=<gcp-project-id>
# GCP region where the job needs to be submitted
REGION=<region>
# Subnet where the hive warehouse exists
SUBNET=<subnet>
# Staging storage location for Dataproc Serverless (Already done in step 1)
GCS_STAGING_LOCATION=<gcs-staging-bucket-folder>

4. Execute the Hive To GCS Dataproc template —

After configuring the job, we will now trigger the bin/start.sh specifying the template we want to run along with the parameter values for the execution.

Note: Dataproc API should be enabled when submitting the job.

bin/start.sh \
--properties=spark.hadoop.hive.metastore.uris=thrift://<hostname-or-ip>:9083 \
-- --template HIVETOGCS \
--templateProperty hive.input.table=<table> \
--templateProperty hive.input.db=<database> \
--templateProperty hive.gcs.output.path=<gcs-output-path>

5. Monitor and view the Spark batch job

You can monitor logs and view the metrics after submitting the job in Dataproc Batches UI.

References

--

--