Using Dataproc Serverless to migrate your Hive data to Bigquery

Vasu Mittal
Google Cloud - Community
3 min readDec 13, 2022

We can use Dataproc Serverless to run Spark batch workloads without provisioning and managing our own cluster. We can specify workload parameters, and then submit the workload to the Dataproc Serverless service.

Dataproc Serverless helps the users with the whole job of infrastructure management — to execute their Apache Spark workloads users are not required to create a cluster first to execute anything. Users can simply select a template as per their use case and perform their respective job with a few clicks and commands.

Hive to Bigquery Migration using Dataproc Serverless

Objective

This blog post will share complete end to end details on how you can use “Hive to Bigquery Dataproc Serverless Template” for data migration. This template will read data from Apache Hive tables and will write it to BigQuery tables.

Setup your GCP Project and Infra

  1. Login to your GCP Project and enable Dataproc API(if it is disabled)
  2. Make sure that the subnet is enabled with Private Google Access, if you are going to use “default” VPC Network generated by GCP then also, you will have to enable private access as shown below:
gcloud compute networks subnets update default --region=us-central1 --enable-private-ip-google-access

3. Create a GCS bucket and staging location for jar files.

export GCS_STAGING_BUCKET="my-gcs-staging-bucket"
gsutil mb gs://$GCS_STAGING_BUCKET

4. To configure the Dataproc Serverless job, you need to export the following variables:-

GCP_PROJECT : GCP project id to run Dataproc Serverless on.

REGION : Region to run Dataproc Serverless in.

GCS_STAGING_LOCATION : GCS staging bucket location, where Dataproc will store staging assets (See Step 3).

Steps to execute Dataproc Template

  1. Clone the Dataproc Templates repository and navigate to the Java template folder.
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java

2. Get authentication credentials (to submit the job).

gcloud auth application-default login

3. Configure the Dataproc Serverless job by exporting the variables needed for submission(as explained in Step 4 of “Setup you GCP Project & Infra”).

export GCP_PROJECT=<project_id> # your Google Cloud project
export REGION=<region> # your region for ex: us-central1
export JARS="gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar" # JAR dependencies
export SUBNET=<subnet> # optional if you are using default
# export GCS_STAGING_LOCATION=<gcs-staging-bucket-folder> # already done at step 3(Under Setup GCP Project & Infra)

4. Execute the below command:-

GCP_PROJECT=<gcp-project-id> \
REGION=<region> \
SUBNET=<subnet> \
GCS_STAGING_LOCATION=<gcs-staging-bucket-folder> \
HISTORY_SERVER_CLUSTER=<history-server> \
bin/start.sh \
--properties=spark.hadoop.hive.metastore.uris=thrift://<hostname-or-ip>:9083 \
-- --template HIVETOBIGQUERY \
--templateProperty hivetobq.bigquery.location=<required_bigquery destination> \
--templateProperty hivetobq.sql=<hive_sql> \
--templateProperty hivetobq.write.mode=<Append|Overwrite|ErrorIfExists|Ignore> \
--templateProperty hivetobq.temp.gcs.bucket=<gcs_bucket_path>

Note: Please enclose the SQL query within double quotes. For ex:

--templateProperty  hivetobq.sql="select * from dbname.tablename"

Also, there are two optional properties as well with “Hive to BigQuery” Template. Please find their details below :-

--templateProperty hivetobq.temp.table='temporary_view_name' 
--templateProperty hivetobq.temp.query='select * from global_temp.temporary_view_name'

These properties are responsible for applying some spark sql transformations before loading data into BigQuery. The only thing we need to remember is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- “Table or view not found:”

In addition to above, just in case you need to specify spark properties supported by Dataproc Serverless for ex: adjust the number of drivers, cores, executors etc — You can edit the OPT_PROPERTIES values in start.sh file.

5. Monitor the Spark batch job

After submitting the job, we will be able to see it in the Dataproc Batches UI. From there, we can view both metrics and logs for the job.

References

For any queries/suggestions please reach out to: dataproc-templates-support-external@googlegroups.com

--

--