Cloud Spanner export query results using Dataproc Serverless

Cloud Spanner is fully managed relational database with unlimited scale, strong consistency, and up to 99.999% availability. In this post we will explore exporting data for a Spanner Table or SQL Query using Dataproc Serverless.

Key Benefits

  1. Dataproc Serverless is fully managed, serverless and autoscaling.
  2. SpannerToGCS template is open source, fully customizable and ready to use for simple jobs.
  3. Ephemeral, spark resources are release once job has ended.
  4. You can export data in avro, parquet, csv and orc formats.

Simple Usage

1. Create a GCS bucket and staging location for jar files.

2. Clone git repo in a cloud shell which is pre-installed with various tools. Alternatively use any machine pre-installed with JDK 8+, Maven and Git.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.gitcd dataproc-templates/java

3. Obtain authentication credentials (to submit the job).

gcloud auth application-default login

4. Execute SpannerToGCS template.
Eg:

export GCP_PROJECT=my-test-gcp-proj1
export REGION=us-central1
export GCS_STAGING_LOCATION=gs://my-test-gcp-proj1/staging
bin/start.sh \
-- --template SPANNERTOGCS \
--templateProperty project.id=my-test-gcp-proj1 \
--templateProperty spanner.gcs.input.spanner.id=spann \
--templateProperty spanner.gcs.input.database.id=testdb \
--templateProperty "spanner.gcs.input.table.id=(select trip_id, bikeid, duration_minutes from bikeshare)" \
--templateProperty spanner.gcs.output.gcs.path=gs://my-test-gcp-proj1/bikeshare/spark_export/ \
--templateProperty spanner.gcs.output.gcs.saveMode=Overwrite

NOTE: It will ask you to enable Dataproc Api, if not enabled already.

Setting up daily incremental data export

End of the day/week/month exports are sometimes needed. On top of it, you may want to extract only incremental changes as exporting whole data at every iteration might be overkill.

NOTE: Serverless Dataproc does not support realtime jobs. Hence, if the goal is to replicate changes realtime from Spanner to GCS, then you will need to look into Change Data Capture (CDC) alternatives.

  1. Using Cloud Spanner’s commit timestamp feature you can keep track rows inserted / updated (deletes cannot be captured).
  2. Re-write sql query, submitted to spark, to capture changes since last execution. Eg:
"spanner.gcs.input.table.id=(select trip_id, bikeid, duration_minutes from bikeshare where LastUpdateTime >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY);"

3. Schedule the batch job.
GCP natively provides Cloud Scheduler + Cloud Function which can be used to submit spark batch jobs. Alternatively self managed softwares like linux cron tab, jenkins etc can be used as well.

Setting additional spark properties

In case you need to specify spark properties supported by Dataproc Serverless like adjust the number of drivers, cores, executors etc.

You can edit the OPT_PROPERTIES values in start.sh file.

Other alternatives

  1. GUI based tool DBeaver to export sql query results into csv format.
  2. Cloud Spanner federated queries to export data using BigQuery into Avro, CSV and other supported formats.
  3. Dataflow template for Spanner to Text to export as csv files or Spanner to Avro

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store