Export data from AWS S3 to BigQuery using Dataproc Serverless

Neerajshivhare
Google Cloud - Community
3 min readDec 13, 2022

Dataproc Serverless for Spark runs batch workloads without provisioning and managing a cluster. It can dynamically scale workload resources, such as the number of executors, to run your workload efficiently. As data developers, this allows us to focus on business logic, instead of spending time managing infrastructure.

In contrast, Dataproc on Compute Engine offers managed Hadoop and Spark service on GCP. It is ideal for users who want to provision and manage infrastructure, then execute workloads on Spark.

Dataproc Serverless Templates: Ready to use, open sourced, customisable templates based on Dataproc Serverless for Spark. These templates help the data engineers to further simplify the process of development on Dataproc Serverless, by consuming and customising the existing templates as per their requirements.

In this post we will explore how to perform batch load of data from a AWS S3 bucket to BigQuery using Dataproc Serverless for Spark.

Prerequisite

  • Google Cloud SDK installed and authenticated
  • Cloud Shell or a machine with Java 8, Maven 3 and Git pre-installed

S3 to BigQuery template

The template will be reading data from Amazon S3 bucket and writing it to a BigQuery table. It uses the Spark BigQuery connector and the Spark S3 connector.

The template allows the following parameters to be configured through the execution command:

s3.bq.access.key : Access key for S3 bucket.
s3.bq.secret.key: Secret key requered to access S3 bucket.
s3.bq.input.format: file format in S3 bucket. Example : avro,parquet,csv,json>
s3.bq.input.location: S3 input location.S3 input location must begin with s3a://
s3.bq.output.dataset.name:Output dataset of BigQuery where the tables resides or needs to be created
s3.bq.output.table.name:Output table name of BigQuery
s3.bq.ld.temp.bucket.name: A preexisting S3 bucket name, where temporary files are staged. Example: templates-s3tobq

Key Benefits

  1. Use Dataproc Serverless to run Spark batch workloads without managing Spark framework.
  2. S3ToBigquery Template is open source, configuration driven, and ready to use. Only AWS S3 credentials are required to execute the code.
  3. Supported File formats are Avro, Parquet ,CSV and JSON.

Basic Usage

  1. Ensure you have enabled the subnet with Private Google Access, if you are going to use “default” VPC Network generated by GCP. Still you will need to enable private access as below.
gcloud compute networks subnets update default --region=us-central1 --enable-private-ip-google-access

2. Dataproc Servereless requires Cloud NAT in order to have access beyond GCP. To enable this follow these steps. In step (4) select the VPC from which you intend to run the Dataproc Serverless job.

3. Obtain the Access key and Secret Key from the AWS S3 bucket required to access the S3 bucket folder.

4. Clone git repo in a cloud shell which is pre-installed with various tools. Alternatively use any machine pre-installed with JDK 8+, Maven and Git.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.gitcd dataproc-templates/java

5. Obtain authentication credentials (to submit the job).

gcloud auth application-default login

6. Execute AWS S3 to BigQuery Dataproc Serverless template:

GCP_PROJECT=<gcp-project-id> \
REGION=<region> \
SUBNET=<subnet> \
GCS_STAGING_BUCKET=<gcs-staging-bucket-folder> \
HISTORY_SERVER_CLUSTER=<history-server> \
bin/start.sh \
-- --template S3TOBIGQUERY \
--templateProperty s3.bq.access.key=<s3-accesss-key> \
--templateProperty s3.bq.secret.key=<s3-secret-key> \
--templateProperty s3.bq.input.format=<avro,parquet,csv,json> \
--templateProperty s3.bq.input.location=<s3-input-location> \
--templateProperty s3.bq.output.dataset.name=<bq-dataset-name> \
--templateProperty s3.bq.output.table.name=<bq-output-table> \
--templateProperty s3.bq.ld.temp.bucket.name=<temp-bucket>

7. Optionally update the following properties in the template.properties file. Note that template properties provided as arguments in the execution command will have priority over those specified in the template.properties file.

s3.bq.access.key=<s3-accesss-key>
s3.bq.secret.key=<s3-secret-key>
s3.bq.input.format=<avro,parquet,csv,json>
s3.bq.input.location=<s3-input-location>
s3.bq.output.dataset.name=<bq-dataset-name>
s3.bq.output.table.name=<bq-output-table>
s3.bq.ld.temp.bucket.name=<temp-bucket>

References

--

--