GCS to BigQuery via Dataproc Serverless: Part 1 (Overview)

Amandeep Saluja
2 min readNov 8, 2023

Bonjour 👋

As part of our GCS to BigQuery Pipeline via Different GCP Services project, we will be using Dataproc Serverless to process our Excel File.

In this article, we will focus on understanding the flow of our process. So, lets get started :)

Technologies Used

  1. GCP Services
    - BigQuery
    - Cloud Functions
    - Cloud Storage
    - Dataproc Serverless (Apache Spark)
    - Workload Identity Federation
  2. GitHub Actions
  3. Python
  4. Terraform

ETL Flow

Okay. Lets see what we are trying to do here.

Step 1: We will be dropping Excel files to a GCS bucket.

Step 2: The Excel file dropped will trigger a Cloud Function to create a Dataproc Serverless job using the Google provided template.

However, we know Dataproc Serverless template doesn’t support Excel Files as input. So, we need to convert it into CSV.

Step 3: Trigger XLSX to CSV cloud function which we developed in this post.

Step 4: Now that we have the CSV available, we will create the Dataflow Job via the same function used in Step 2.

Step 5: Load results to BigQuery

Now that we have our steps laid out, lets jump into the development. See you there.

Au revoir! 👋

--

--