Migrate GCS to GCS using Dataproc Serverless

Ankul Jain
Google Cloud - Community
4 min readNov 17, 2022

Dataproc Serverless keeps users covered with the aspects of infrastructure management — to execute their Apache Spark workloads users need not create a cluster first before they are able to execute anything on them.

Additionally, Dataproc Templates offers predefined jobs that can be used as-is or customized based on your requirements. Thus, all in all this reduces the overall time in managing the infrastructure as well as building the spark code.

GCStoGCS migration using Dataproc

Objective

This blog post covers how you can utilize GCSToGCS Dataproc Serverless template for data migration. Moreover, this post covers the steps to be executed for both the primary coding languages viz. Python and Java.

Setup your GCP Project and Infra

  1. Login to your GCP Project and enable Dataproc API, if disabled
  2. Make sure the subnet is enabled the subnet with Private Google Access, if you are going to use “default” VPC Network generated by GCP. Still you will need to enable private access as below
gcloud compute networks subnets update default --region=us-central1 --enable-private-ip-google-access

3. Create a GCS bucket and staging location for jar files.

export GCS_STAGING_BUCKET=”my-gcs-staging-bucket”
gsutil mb gs://$GCS_STAGING_BUCKET

Steps to execute Dataproc Template

  1. Clone the Dataproc Templates repository and navigate to the Java template folder.
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java

To navigate to Python template folder execute below command:

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/python

2. Obtain authentication credentials (to submit the job).

gcloud auth application-default login

3. Configure the Dataproc Serverless job by exporting the variables needed for submission —

GCP_PROJECT : GCP project id to run Dataproc Serverless on.

REGION : Region to run Dataproc Serverless in.

GCS_STAGING_LOCATION : GCS staging bucket location, where Dataproc will store staging assets (See Step 3 in GCP Project setup section).

4. [Python Template] Gather the values for below optional parameters

  • GCS.TO.GCS.INPUT.FORMAT : GCS input file format (one of: avro,parquet,csv,json)
  • GCS.TO.GCS.INPUT.LOCATION : GCS location of the input files
  • GCS.TO.GCS.OUTPUT.FORMAT : GCS input file format (one of: avro,parquet,csv,json)
  • GCS.TO.GCS.OUTPUT.LOCATION : GCS location of the destination files
  • GCS.TO.GCS.OUTPUT.MODE : Output write mode (one of: append,overwrite,ignore,errorifexists) (Defaults to append)
  • GCS.TO.GCS.TEMP.VIEW.NAME : Temp view name for creating a spark sql view on source data.
  • GCS.TO.GCS.SQL.QUERY: : SQL query for data transformation.

Note: When using the transformation properties, the name of the Spark temporary view and the name of the view in the query should exactly match to avoid “table/view not found” error.

Execute the provided bin/start.sh script using the mandatory environment variables to submit the job to dataproc serverless - Following is a sample Execution command for Python template:

./bin/start.sh \
-- --template=GCSTOGCS \
--gcs.to.gcs.input.location="<gs://bucket/path>" \
--gcs.to.gcs.input.format="<json|csv|parquet|avro>" \
--gcs.to.gcs.output.location="<gs://bucket/path>"
--gcs.to.gcs.output.format="<json|csv|parquet|avro>" \
--gcs.to.gcs.output.mode="<append|overwrite|ignore|errorifexists>" \
--gcs.to.gcs.temp.view.name="temp" \
--gcs.to.gcs.sql.query="select *, 1 as col from temp" \

5. [Java Template] Gather the values for below optional parameters

  • GCS.TO.GCS.INPUT.FORMAT : GCS input file format (one of: avro,parquet,csv,json)
  • GCS.TO.GCS.INPUT.LOCATION : GCS location of the input files
  • GCS.TO.GCS.OUTPUT.FORMAT : GCS input file format (one of: avro,parquet,csv,json)
  • GCS.TO.GCS.OUTPUT.LOCATION : GCS location of the destination files
  • GCS.TO.GCS.OUTPUT.MODE : Output write mode (one of: append,overwrite,ignore,errorifexists) (Defaults to append)
  • GCS.TO.GCS.TEMP.TABLE : Temp table name for creating a spark sql view on source data.
  • GCS.TO.GCS.TEMP.QUERY: : SQL query for data transformation.

Note: When using the transformation properties, the name of the Spark temporary table and the name of the table in the query should exactly match to avoid “table/view not found” error.

Execute the provided bin/start.sh script using the mandatory environment variables to submit the job to dataproc serverless - Following is a sample Execution command for Java template:

bin/start.sh \
-- --template GCSTOJDBC \
--templateProperty project.id=my-gcp-project \
--templateProperty gcs.gcs.input.location=gs://my-gcp-project-input-bucket/filename.avro \
--templateProperty gcs.gcs.input.format=avro \
--templateProperty gcs.gcs.output.location=gs://my-gcp-project-output-bucket \
--templateProperty gcs.gcs.output.format=csv \
--templateProperty gcs.jdbc.output.saveMode=Overwrite
--templateProperty gcs.gcs.temp.table=temp \
--templateProperty gcs.gcs.temp.query='select *, 1 as col from temp'

Providing Spark Properties

In case you need to specify spark properties supported by Dataproc Serverless eg. adjust the number of drivers, cores, executors etc - You can edit the OPT_PROPERTIES values in start.sh file.

6. Monitor the Spark batch job

After submitting the job, we will be able to see in the Dataproc Batches UI. From there, we can view both metrics and logs for the job.

References

For any queries/suggestions reach out to: dataproc-templates-support-external@googlegroups.com

--

--