GCS to BigQuery Pipeline via Different GCP Services

Amandeep Saluja
2 min readOct 31, 2023

--

Welcome back to another big Data Engineering project we are going to be building together.

In this post, I’m going to be listing (and linking) out all Medium articles I’m going to publish in coming days utilizing different GCP Services. The Goal is to get familiar with different services and see how it works.

The goal of all upcoming projects will be the same. Process the Excel file dropped into the bucket and load data into BigQuery.

Why Excel? Well, every business uses it on the daily basis. And we already have lot of pre-built templates available to process csv or parquet files. There are lot of challenges we are going to come across when building data pipelines for Excel. So, buckle up :)

Technologies Used

  1. Apache Airflow
  2. Apache Beam
  3. Apache Spark
  4. Docker
  5. GCP Services
    - BigQuery
    - Cloud Composer (Apache Airflow)
    - Cloud Functions
    - Cloud Storage
    - Dataflow Flex Template (Apache Beam)
    - Dataflow Google Provided Template (Apache Beam)
    - Dataproc Serverless (Apache Spark)
    - Eventarc
    - Workflows
    - Workload Identity Federation
  6. GitHub Actions
  7. Python
  8. Terraform

Below are links to all the resources I have (and will be) published.

Before we start anything, we will be working on creating a generic HTTP Cloud Function (XLSX to CSV Cloud Function) so that it can be used across all pipelines. The purpose to create this is Dataproc Serverless and Dataflow does not have pre-built templates that handles Excel as Input.

GCS to BigQuery via Cloud Functions

  1. Overview, Setup, and Deployment

GCS to BigQuery via Workflows

  1. Overview, Setup, and Deployment

GCS to BigQuery via Dataflow Google Provided Template

  1. Overview
  2. Dataflow Job Submit Cloud Function (Cloud Event Trigger)
    a. Function Setup
    b. Terraform Setup
    c. Deployment via GitHub Actions
  3. Troubleshooting

GCS to BigQuery via Dataflow Flex Template

  1. Overview
  2. Flex Template
    a. Pipeline Setup
    b. Terraform Setup
    c. Docker Setup
    d. Deployment via GitHub Actions
  3. Dataflow Job Submit Cloud Function (Cloud Event Trigger)
    a. Function Setup
    b. Terraform Setup
    c. Deployment via GitHub Actions
  4. Troubleshooting

There are more services I’m going to explore. I will keep them updated here for your reference. If you have some other GCP services I can utilize, please let me know in the comments below.

Okay. Lets start this journey ;)

--

--