Cloud Spanner export query results using Dataproc Serverless
Cloud Spanner is fully managed relational database with unlimited scale, strong consistency, and up to 99.999% availability. In this post we will explore exporting data for a Spanner Table or SQL Query using Dataproc Serverless.
- Dataproc Serverless is fully managed, serverless and autoscaling.
- SpannerToGCS template is open source, fully customizable and ready to use for simple jobs.
- Ephemeral, spark resources are release once job has ended.
- You can export data in avro, parquet, csv and orc formats.
- Ensure you have enabled the subnet with Private Google Access, if you are going to use “default” VPC Network generated by GCP. Still you will need to enable private access as below. (Refer here for details)
gcloud compute networks subnets update default --region=us-central1 --enable-private-ip-google-access
2. Create a GCS bucket and staging location for jar files.
3. Clone git repo in a cloud shell which is pre-installed with various tools. Alternatively use any machine pre-installed with JDK 8+, Maven and Git.
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.gitcd dataproc-templates/java
4. Obtain authentication credentials (to submit the job).
gcloud auth application-default login
5. Execute SpannerToGCS template.
export GCS_STAGING_LOCATION=gs://my-test-gcp-proj1/stagingbin/start.sh \
-- --template SPANNERTOGCS \
--templateProperty project.id=my-test-gcp-proj1 \
--templateProperty spanner.gcs.input.spanner.id=spann \
--templateProperty spanner.gcs.input.database.id=testdb \
--templateProperty "spanner.gcs.input.table.id=(select trip_id, bikeid, duration_minutes from bikeshare)" \
--templateProperty spanner.gcs.output.gcs.path=gs://my-test-gcp-proj1/bikeshare/spark_export/ \
NOTE: It will ask you to enable Dataproc Api, if not enabled already.
Setting up daily incremental data export
End of the day/week/month exports are sometimes needed. On top of it, you may want to extract only incremental changes as exporting whole data at every iteration might be overkill.
NOTE: Serverless Dataproc does not support realtime jobs. Hence, if the goal is to replicate changes realtime from Spanner to GCS, then you will need to look into Change Data Capture (CDC) alternatives.
- Using Cloud Spanner’s commit timestamp feature you can keep track rows inserted / updated (deletes cannot be captured).
- Re-write sql query, submitted to spark, to capture changes since last execution. Eg:
"spanner.gcs.input.table.id=(select trip_id, bikeid, duration_minutes from bikeshare where LastUpdateTime >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY);"
3. Schedule the batch job.
GCP natively provides Cloud Scheduler + Cloud Function which can be used to submit spark batch jobs. Alternatively self managed softwares like linux cron tab, jenkins etc can be used as well.
Setting additional spark properties
In case you need to specify spark properties supported by Dataproc Serverless like adjust the number of drivers, cores, executors etc.
You can edit the OPT_PROPERTIES values in start.sh file.