Importing data from GCS to Databases (via JDBC) using Dataproc Serverless

Hitesh Hasija
Google Cloud - Community
3 min readJun 4, 2022

Managing Server’s while running Spark jobs is always a challenge. Using fully managed on demand servers with Spark jobs is the need of today’s era. It helps developers in concentrating towards core application logic, instead of spending time in managing the framework. Dataproc Serverless is one such product provided by Google Cloud Platform.

Most of the transactional data still resides on Relational Database Server’s, which could be connected by using JDBC driver. MySQL, Oracle and SQL Server are predominantly used for this.

Today’s world is moving towards Cloud based storage services for storing data. It has triggered the use of Google Cloud Storage Buckets.

This article is about transferring the data from GCS Buckets to JDBC Database’s via Dataproc Serverless.

Key Benefits

  1. Use Dataproc Serverless to run Spark batch workloads without managing Spark framework. Batch size is also configurable in this template.
  2. GCSToJDBC Template is open source, configuration driven, and ready to use. Only JDBC and GCS credentials are required to execute the code.
  3. Supported File formats are Avro, Parquet and ORC.
  4. JDBCToGCS Template can be used vice versa i.e. for exporting the data from Database’s to GCS Buckets via JDBC.

Usage

  1. Ensure you have enabled the subnet with Private Google Access, if you are going to use “default” VPC Network generated by GCP. Still you will need to enable private access as below.
gcloud compute networks subnets update default --region=us-central1 --enable-private-ip-google-access

2. Create a GCS bucket and staging location for jar files.

3. Clone git repo in a cloud shell which is pre-installed with various tools. Alternatively use any machine pre-installed with JDK 8+, Maven and Git.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.gitcd dataproc-templates/java

4. Obtain authentication credentials (to submit the job).

gcloud auth application-default login

5. Execute GCSToJDBC template.
Eg:

export GCP_PROJECT=my-gcp-project
export REGION=us-central1
export GCS_STAGING_LOCATION=gs://staging-bucket \
export JARS=gs://jar-location/mysql-connector-java-8.0.29.jar
bin/start.sh \
-- --template GCSTOJDBC \
--templateProperty project.id=my-gcp-project \
--templateProperty gcs.jdbc.input.location=gs://inputBucket/empavro \
--templateProperty gcs.jdbc.input.format=avro \
--templateProperty gcs.jdbc.output.table=demo \
--templateProperty gcs.jdbc.output.saveMode=Overwrite \
--templateProperty gcs.jdbc.output.url='jdbc:mysql://12.345.678.9:3306/test?user=root&password=root' \
--templateProperty gcs.jdbc.output.driver='com.mysql.jdbc.Driver' \
--templateProperty gcs.jdbc.output.batchInsertSize=1000

NOTE: It will ask you to enable Dataproc Api, if not enabled already.

Setting up daily incremental data export

End of the day/week/month exports are sometimes needed. On top of it, you may want to extract only incremental changes as exporting whole data at every iteration might be overkill.

NOTE: Dataproc Serverless does not support realtime jobs. Hence, if the goal is to replicate changes realtime from GCS to JDBC, then you will need to look into Change Data Capture (CDC) alternatives.

  1. Using Standard Database’s commit timestamp feature you can keep track of rows inserted / updated (deletes cannot be captured).
  2. Re-write sql query, submitted to spark, to capture changes since last execution. Eg:
jdbctogcs.sql='select trip_id, bikeid, duration_minutes from bikeshare where LastUpdateTime >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY'

3. Schedule the batch job.
GCP natively provides Cloud Scheduler + Cloud Function which can be used to submit spark batch jobs. Alternatively self managed softwares like linux cron tab, Jenkins etc. can be used as well.

Setting additional spark properties

In case you need to specify spark properties supported by Dataproc Serverless like adjust the number of drivers, cores, executors etc.

You can edit the OPT_PROPERTIES values in start.sh file.

References
https://medium.com/google-cloud/cloud-spanner-export-query-results-using-dataproc-serverless-6f2f65b583a4
https://cloud.google.com/pubsub/docs/overview
https://github.com/GoogleCloudPlatform/dataproc-templates

--

--