GCS to BigQuery via Dataproc Serverless: Part 2 (Development)

Amandeep Saluja
5 min readNov 8, 2023

Ciao 👋

In Part 1, we saw an overview of our ETL pipeline. In this post, we are going to focus on development and deployment of our pipeline. This post is part of GCS to BigQuery Pipeline via Different GCP Services project.

Folder Structure

Below is how my repo is structured:

📦gcs-to-bigquery-via-dataproc-serverless
┣ 📂.github
┃ ┗ 📂workflows
┃ ┃ ┣ 📜create-dataproc-spark-job.yml
┣ 📂infra
┃ ┣ 📜main.tf
┃ ┣ 📜providers.tf
┃ ┗ 📜variables.tf
┣ 📂src
┃ ┣ 📜config.yaml
┃ ┣ 📜main.py
┃ ┗ 📜requirements.txt
┗ 📜README.md
  • infra folder contains Terraform files which is used to create and deploy the Dataproc job cloud function.
  • src folder contains the source code.

Source Code

This section, we will be working on creating the code to create a Dataflow job via Cloud Storage trigger.

requirements.txt

Standard stuff

functions-framework==3.2.0
google-api-python-client==2.105.0
google-auth==2.23.3
PyYAML==6.0.1

config.yaml

# General Config
PROJECT_ID: "gcp-practice-project-aman"
REGION: "us-central1"

# BigQuery Config
BQ_DATASET: "raw_layer"
BQ_TABLE: "xlxs_to_csv_pipeline"

# Cloud Function Config
EXCEL_TO_CSV_CLOUD_FUNCTION…

--

--