Exporting data from Redshift to GCS using GCP Dataproc Serverless (Java)

Sumit Banerjee
Google Cloud - Community
3 min readNov 21, 2023

With Dataproc Serverless, you can run Spark batch workloads without having to provision and manage your own cluster. The service will run the workload on a managed compute infrastructure, automatically scaling resources as needed. To use dataproc serverless, simply specify the workload parameters and submit the workload to the service.

Dataproc templates provide solutions to common use cases, running on dataproc serverless using Java and Python. This allows us to customize our workloads and run them with ease. One such use case in the current multi-cloud world is exporting data from Redshift to Google Cloud Storage (GCS). In this blog, we will discuss how to process the export of data from Redshift to GCS using Dataproc Serverless.

Key Benefits

  • Run Spark batch workloads without provisioning and managing your own cluster with Dataproc Serverless.
  • Redshift to GCS template is open source, customizable, and ready to use for the jobs you need.
  • Data from Redshift can be ingested into GCS in Parquet, AVRO, CSV, and JSON formats.

Prerequisites

To run these templates, you will need:

  • The Google Cloud SDK installed and authenticated
  • Java 8+
  • Connectivity between Redshift in AWS and Dataproc Serverless in GCP
  • AWS Secret Key and Access Key for accessing the temp S3 location and Redshift IAM role
  • The required JARs, which are listed here

Usage

  1. Create a staging gcs bucket to store/push jar files
export STAGING_BUCKET="<dataproc-staging-bucket>"
gsutil mb gs://$STAGING_BUCKET

2. Clone the dataproc templates github repository and switch the directory to Java.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java

3. Configure required environment variables for dataproc serverless job

  • GCP_PROJECT: The GCP project in which to run the job.
  • REGION: The region in which to run the job.
  • GCS_STAGING_LOCATION: A GCS location where dataproc will store staging assets. This should be within the bucket that you created earlier.
  • SUBNET: The subnet in which to run the job.
export GCP_PROJECT=<gcp-project-id>
export REGION=<region>
export SUBNET=<subnet>
export GCS_STAGING_LOCATION=gs://$STAGING_BUCKET/staging

In this case, the Redshift To GCS requires the JARs listed in the prerequisites to be in the classpath. You can store the JAR file in a bucket, and we will add it using the JARS environment variable.

export JARS=gs://<cloud-storage-bucket-name>/spark-redshift_<version>.jar,gs://<cloud-storage-bucket-name>/redshift_jdbc_<version>.jar,gs://<cloud-storage-bucket-name>/minimal_json<version>.jar

4. Execute the Redshift To GCS dataproc template

After configuring required environment variables, we are prepared to initiate it. We will run the bin/start.sh script, providing the template we want to run and the argument values for the execution.

bin/start.sh \
-- --template REDSHIFTTOGCS \
--templateProperty project.id=<gcp-project-id> \
--templateProperty redshift.aws.input.url=<jdbc:redshift://host-name:port-number/> \
--templateProperty redshift.aws.input.table=<Redshift-table-name> \
--templateProperty redshift.aws.input.temp.dir=<AWS-temp-directory> \
--templateProperty redshift.aws.input.iam.role=<Redshift-S3-IAM-role> \
--templateProperty redshift.aws.input.access.key=<Access-key> \
--templateProperty redshift.aws.input.secret.key=<Secret-key> \
--templateProperty redshift.gcs.output.file.format=<Output-File-Format> \
--templateProperty redshift.gcs.output.file.location=<Output-GCS-location> \
--templateProperty redshift.gcs.output.mode=<Output-GCS-Save-mode>

Note: When you submit the job, you will be prompted to enable the dataproc API if it is not already enabled.

5. Monitor the Spark batch job

After submitting the job, we will be able to see in the Dataproc Batches UI. You can monitor the job from there.

References

--

--