How to Migrate Data from BigQuery to Cloud Storage using Dataproc Serverless [Java]

Abhishek Pratap Singh
Google Cloud - Community
3 min readDec 8, 2023

Dataproc Serverless

Dataproc Serverless allows users to run Spark batch workloads without the need to provision and manage their clusters. They can simply specify the workload parameters and submit it to the Dataproc Serverless service, which will in turn run the workload on a managed compute infrastructure that autoscales resources as required.

Dataproc Logo

Dataproc Templates

Dataproc Templates is an open-source repository maintained by Google Cloud Platform. It provides templates in both Java and Python, that can be used to run Spark workloads on Dataproc Serverless for various common use cases.

One of the common use cases is using Dataproc Serverless to migrate data from one data source to another. In this blog post, we will cover how to move data from BigQuery to Cloud Storage using the BIGQUERYTOGCS Java Dataproc Template.

Setting up your GCP Project

Perform the following steps to setup your GCP project:

  1. Login into your GCP project and enable the Dataproc API.
Enable the API by clicking one the “Enable” button

2. Make sure that “Private Google Access” is switched on under your subnet settings. Even if you are using the default subnet, make sure Private Google Access is enabled.

3. Create a staging bucket. This can be done by either of the following methods:

export GCS_STAGING_BUCKET=”my-gcs-staging-bucket”
gsutil mb gs://$GCS_STAGING_BUCKET

Executing the Dataproc Template

Follow these steps to execute the dataproc template:

  1. Clone the Dataproc git repository in your system
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git

2. Navigate to the Java folder within the repository

cd dataproc-templates/java

3. Provide authentication credentials in order to submit the job

gcloud auth application-default login

4. Make sure Maven is installed and added to the path variable. Incase it is not added to the path variable, locate the `bin` directory of the Apache Maven folder, copy the path, and run the following command:

export PATH=<PATH_OF_APACHE_MAVEN_BIN_DIRECTORY>:$PATH

5. Export the following variables (use the GCP Console for reference):

export GCP_PROJECT=<PROJECT_ID>
export REGION=<PROJECT_REGION>
export GCS_STAGING_LOCATION=<GCS_BUCKET_LOCATION>
export SUBNET=<SUBNET_NAME>

6. Gather values for the following variables:

  • bigquery_input_table_name (in the format of project_id.dataset.table)
  • gcs_output_bucket (in the format gs://bucket_name)
  • output_format (choose between csv, parquet, avro, json)
  • output_mode (choose between Append, Overwrite, ErrorIfExists, Ignore)

7. Run the following command (refer to the Bigquery to Cloud Storage template documentation for an example submission):

bin/start.sh \
-- --template BIGQUERYTOGCS \
--templateProperty project.id=<PROJECT_ID> \
--templateProperty bigquery.gcs.input.table=<BQ_INPUT_TABLE_NAME> \
--templateProperty bigquery.gcs.output.format=<OUTPUT_FORMAT> \
--templateProperty bigquery.gcs.output.location=<GCS_OUTPUT_BUCKET> \
--templateProperty bigquery.gcs.output.mode=<OUTPUT_MODE>

8. Monitor the Spark Job on Dataproc Batches in GCP Console.

References

--

--