How to Migrate Data from BigQuery to Cloud Storage using Dataproc Serverless [Java]

Published in

Google Cloud - Community

3 min readDec 8, 2023

Dataproc Serverless

Dataproc Serverless allows users to run Spark batch workloads without the need to provision and manage their clusters. They can simply specify the workload parameters and submit it to the Dataproc Serverless service, which will in turn run the workload on a managed compute infrastructure that autoscales resources as required.

Dataproc Templates

Dataproc Templates is an open-source repository maintained by Google Cloud Platform. It provides templates in both Java and Python, that can be used to run Spark workloads on Dataproc Serverless for various common use cases.

One of the common use cases is using Dataproc Serverless to migrate data from one data source to another. In this blog post, we will cover how to move data from BigQuery to Cloud Storage using the BIGQUERYTOGCS Java Dataproc Template.

Setting up your GCP Project

Perform the following steps to setup your GCP project:

Enable the API by clicking one the “Enable” button

2. Make sure that “Private Google Access” is switched on under your subnet settings. Even if you are using the default subnet, make sure Private Google Access is enabled.

3. Create a staging bucket. This can be done by either of the following methods:

Click the “Create” button on the Cloud Storage page in the GCP console or
Run the following commands in the Cloud Shell:

export GCS_STAGING_BUCKET=”my-gcs-staging-bucket”
gsutil mb gs://$GCS_STAGING_BUCKET

Executing the Dataproc Template

Follow these steps to execute the dataproc template:

Clone the Dataproc git repository in your system

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git

2. Navigate to the Java folder within the repository

cd dataproc-templates/java

3. Provide authentication credentials in order to submit the job

gcloud auth application-default login

4. Make sure Maven is installed and added to the path variable. Incase it is not added to the path variable, locate the `bin` directory of the Apache Maven folder, copy the path, and run the following command:

export PATH=<PATH_OF_APACHE_MAVEN_BIN_DIRECTORY>:$PATH

5. Export the following variables (use the GCP Console for reference):

export GCP_PROJECT=<PROJECT_ID>
export REGION=<PROJECT_REGION>
export GCS_STAGING_LOCATION=<GCS_BUCKET_LOCATION>
export SUBNET=<SUBNET_NAME>

6. Gather values for the following variables:

bigquery_input_table_name (in the format of project_id.dataset.table)
gcs_output_bucket (in the format gs://bucket_name)
output_format (choose between csv, parquet, avro, json)
output_mode (choose between Append, Overwrite, ErrorIfExists, Ignore)

7. Run the following command (refer to the Bigquery to Cloud Storage template documentation for an example submission):

bin/start.sh \
-- --template BIGQUERYTOGCS \
--templateProperty project.id=<PROJECT_ID> \
--templateProperty bigquery.gcs.input.table=<BQ_INPUT_TABLE_NAME> \
--templateProperty bigquery.gcs.output.format=<OUTPUT_FORMAT> \
--templateProperty bigquery.gcs.output.location=<GCS_OUTPUT_BUCKET> \
--templateProperty bigquery.gcs.output.mode=<OUTPUT_MODE>

8. Monitor the Spark Job on Dataproc Batches in GCP Console.

How to Migrate Data from BigQuery to Cloud Storage using Dataproc Serverless [Java]

Dataproc Serverless

Dataproc Templates

Setting up your GCP Project

Executing the Dataproc Template

References

Written by Abhishek Pratap Singh