Keep-on-Track with Hundred-Thousands of Google Cloud Storage (GCS) Object with Go-base Cloud Function, Provisioned with Terraform

Muhammad Ilham H
EDTS
Published in
6 min readAug 16, 2024

Behind Story

As data engineer, it is common to receive files with different format, such as CSV or TXT. This files, then, will be processed with an ETL process using service like Spark, GCP Dataflow, etc, or just be loaded with a data warehouse connector, and be transformed later using native SQL or data modelling framework like DBT, Dataform, etc.

As other services relay with this process, performance has to be considered.

Introduction

We were using Google Cloud Storage (GCS) as our data lake with bucket notification to track all of object activities within it had been enabled. This notification will send JSON formatted message to GCP PubSub as message broker before it is forwarded the message to Python-based Cloud Function (CF) and is stored in Google BigQuery (BQ). The architecture will be showed in Architecture section.

Our daily (current) trend of incoming new file that was ingested to our GCS showed below.

Our Cloud Function currently was able to track the activity with specification and performance below.

  • Code language: Python
  • Memory allocated: 256 MiB
  • Max instance count: 50

As visualised in the picture, the Cloud Function was able to gave performance with mean performance (50% percentile) mentioned below.

  • Requests per second: ~15 requests/seconds
  • Execution time per request: 620 ms per request
  • Memory usage: 105 Mib
  • Instance count used (maximum): 20

What’s This Article Goal?

The goal for this article is how to reduce the execution time per request by migrating the code language from Python to Golang, because the information logged by this Cloud Function will be used by SensorOperator in Apache Airflow. The longer it takes to process means Airflow have to wait longer, which in result will block the worker pool from doing any other tasks. There is no specific number for this goal (because it also relay on BigQuery API), but this article will set the target under 100 ms.

Cloud Function will be retained, but this article will add Terraform as Infrastruture as a Code (IaaC) to provision all of the necessary services.

*This article is purposed for research report.

*For Production uses, some tweak have to be made, which will be mentioned below.

Architecture

This figure shows complete architecture of our research. Overall, there are no services were replaced.

All of the services will be deployed and documented by Terraform with the service variable and resources stack available under /build folder.

Build

  1. Clone the source code from GitHub repository.
git clone https://github.com/ilhamhanif/gcs-bucket-notif-log-bq.git
cd gcs-bucket-notif-log-bq
  1. Deploy all of the services with Terraform

Ensure to create a project in GCP by following this official documentation. All of the API(s) required had been declared and will automatically enabled by Terraform.

Make sure to change the project_id variable under /build/variable-dev.tfvars file.

Then deploy all the services with following commands.

bash terraform-run.sh init -upgrade
bash terraform-run.sh build deploy variable-dev

Demonstration

This following steps will be used during demonstration.

  1. Running the object sequencer
  2. Check the data in BigQuery
  3. Check Cloud Function performance
  4. Cleaning Up

1. Running the Object Sequencer

A script main.go had been created to orchestrate GCS objects under /sequencer folder. This sequencer script will create 10000 files in GCS and directly delete it after which create 20000 events total to be processed in Cloud Function.

Make sure to change the project_id variable to match your GCP Project ID created before.

Run the sequencer with this command.

cd sequencer
go run main.go

2. Check the data in BigQuery

A test script sequencer_test.go had been created under /test folder. This test script will check all events that had been successfully processed by Cloud Function and stored in BQ. Each event will have 10000 rows respectively, same as number of files orchestrated in demonstration step 1.

Make sure to change the project_id and job_id that can be found from sequencer return value printed in console.

Run this test using this following commands.

cd test
go test -v sequencer_test.go

3. Check Cloud Function Performance

Go to Cloud Function page to check the performance.

As visualised in the picture, the Cloud Function was able to gave performance with mean performance (50% percentile) mentioned below.

  • Requests per second: ~13 requests/seconds
  • Execution time per request: 80 ms per request
  • Memory usage: 70 MiB
  • Instance count used (maximum): 5

To summarized, the execution time per requests was reduced by 87% from 620 ms to 80 ms per request, which implies the maximum number of used instances is reduced by 15 instances from 20 to 5 instances, and the memory usage is reduced by 30% from to 105MiB to 70 MiB. The reduce of memory usage allow us to reduce the allocation memory from 256 MiB to 128 MiB.

Despite the less instance and memory allocated, it still able to handle the same rate of requests per second.

If this system is scaled to receive 20 millions requests per month, it will cut the price from $87 to $12 (87%)(calculated with Google Cloud Pricing Calculator).

4. Cleaning Up

Use the command below to clean up everything.

bash terraform-run.sh destroy variable-dev

Improvement

As mentioned above, this article is supposed to be a research report. Several tweaks and improvements are needed t use it in Production, which will mentioned below.

  1. Consider to create a CI/CD process which elaborate the Terraform stack and code changes in Git repository.
  2. Consider to use other Terraform backend aside from local backend used in this article, like GCS, S3, or any other backend which can be found in document under reference below.
  3. Collaborate with your team to define the best folder stack for all the infrastructures (consider to use Terraform Modules).
  4. Etc.

--

--