Importing data from GCS to JDBC Databases using Dataproc Serverless

Published in

Google Cloud - Community

3 min readDec 23, 2022

Spark is one of the most popular big data distributed processing framework. If you are looking to migrate your existing spark workload to cloud or create a new spark job on Cloud, you no longer need to worry about retuning job properties or right-sizing the clusters. Dataproc Serverless accelerates your journey on the Google Cloud by running spark workloads without you worrying about provisioning and managing the infrastructure.

Google Cloud Community have setup an open source repository featuring most common use cases for Dataproc Serverless. In this blogpost, we go through the GCS to JDBC Pyspark template from the repository.

Before we get into any details, let’s go through key benefits of using the templates:

Dataproc Serverless templates are open source and can easily be cloned and customised for workload migrations.
No need to provision or manage infrastructure.
Multiple file formats are supported, namely — JSON, Avro, Parquet and CSV.
These templates are configuration driven and can be used for similar use cases very easily by just changing the connection parameters.

Prerequisites

For running these templates, we will need:

The Google Cloud SDK installed and authenticated.
Python 3.7+ installed.
A VPC subnet with Private Google Access enabled. The default subnet is suitable, as long as Private Google Access was enabled.
JDBC jar for the target database.

Steps to execute Dataproc Template

1. Clone the Dataproc Templates repository and navigate to Python templates folder.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/python

2. Obtain authentication credentials to submit the job.

gcloud auth application-default login

3. Create a GCS bucket and staging location for jar files.

export STAGING_BUCKET=”my-gcs-staging-bucket”
gsutil mb gs://$STAGING_BUCKET

4. Configure the Dataproc Serverless job by exporting the variables needed for submission —

GCP_PROJECT : GCP project id to run Dataproc Serverless on

REGION : Region to run Dataproc Serverless in

GCS_STAGING_LOCATION : GCS staging bucket location, created in Step 3

SUBNET : The VPC subnet to run Dataproc Serverless on, if not using the default subnet (format: projects/<project_id>/regions/<region>/subnetworks/<subnetwork> )

JARS : Comma separated jars used in the job. For this use case, it shall include JDBC driver jar for target database.

# Project ID to run the Dataproc Serverless Job
export GCP_PROJECT=<project_id># GCP region where the job should be submitted
export REGION=<region># The staging location for Dataproc
export GCS_STAGING_LOCATION=gs://$STAGING_BUCKET/staging
export SUBNET=<subnet> # Optional if default  
export JARS="<gcs_path_to_jdbc_jar_files>/mysql-connector-java-8.0.29.jar,<gcs_path_to_jdbc_jar_files>/postgresql-42.2.6.jar,<gcs_path_to_jdbc_jar_files>/mssql-jdbc-6.4.0.jre8.jar"

5. Gather the values for below parameters, as per the environment:

gcs.jdbc.input.location : GCS location of the input files

gcs.jdbc.input.format : Input file format (one of: avro, parquet, csv, json)

gcs.jdbc.output.table : JDBC output table name

gcs.jdbc.output.mode : Output write mode (one of: append,overwrite,ignore,errorifexists), defaults to append

gcs.jdbc.output.url : JDBC output URL. Changes with the target database.

gcs.jdbc.output.driver : JDBC output driver name

gcs.jdbc.batch.size : JDBC output batch size, determines how many rows to insert per round trip. Defaults to 1000

6. Let’s execute the template now. Note that there are four approaches to submit the job. In this blogpost, we are submitting the spark serverless job through the bin/start.sh file in the repository.

Sample Execution

export GCP_PROJECT=<gcp-project-id> 
export REGION=<region>  
export GCS_STAGING_LOCATION=<gcs staging location> 
export SUBNET=<subnet>   
export JARS="<gcs_path_to_jdbc_jar_files>/mysql-connector-java-8.0.29.jar"

./bin/start.sh \
-- --template=GCSTOJDBC \
--gcs.jdbc.input.location="gs://<gcs_bucket>/<path_to_input_files>"
--gcs.jdbc.input.format="<avro/parquet/csv/json>" \
--gcs.jdbc.output.url="jdbc:mysql://<hostname>:<port>/<dbname>?user=<username>&password=<password>" \
--gcs.jdbc.output.driver="com.mysql.cj.jdbc.Driver" \
--gcs.jdbc.output.table="<output table name>" \
--gcs.jdbc.output.mode="<append/overwrite/ignore/errorifexists>" \
--gcs.jdbc.batch.size=5000

7. Monitor the spark job. Job metrics and logs should be available in Dataproc Batches UI after submitting the job.

Additional Properties

If you want to provide additional spark properties, it can be provided by adding below snippet in the execution command.

--properties=<spark.something.key>=<value>

In case you need to use properties supported by spark JDBC writer, you can clone and customise the code accordingly.

Importing data from GCS to JDBC Databases using Dataproc Serverless

Prerequisites

Steps to execute Dataproc Template

Sample Execution

Additional Properties

References

Written by Ritika Neema