Google Cloud Function Beyond the Limits
Effective utilization of Cloud Function using BigQuery computing power
Google Cloud Functions is a serverless execution environment designed for building and connecting cloud services. It enables fully managed infrastructure, so that cloud users need to focus on application code without worrying about underlying hardware and software.
Use Case
- Design a data processing pipeline to load data from Cloud Storage to BigQuery
- Perform data extract and loading with basic transformations, as-soon-as new file(s) arrived into a Cloud Storage bucket
- Input file size can vary from 1 MB to 1000 MB
Challenges
Although, there are several cloud services which can suffice the purpose, but the fine-grained, on-demand nature of Cloud Function makes it an easy choice. Having said that, there are few restrictions made it challenging to carve-out an optimal solution. Few examples depicted below:
Solution
In this blob, the implemented solution addressed the challenges of loading large files which can’t fit Cloud Function limits. The hallmark of this
approach is to utilize all the imperative features of Cloud Function and capitalizing compute power of BigQuery. Following steps explains how it is accomplished:
- Cloud Function invoked on arrival of new files into Cloud Storage bucket
- A BigQuery external table created using ExternalConfig() pointing to Cloud Storage bucket file(s)
- BigQuery insert statement is used to read data from external table and load it to BigQuery transaction/master table
Cloud Function Source Code
Cloud Function source code split into multiple files:
- Schema File (schema.json): This file contains BigQuery external table column names and data-types.
- Configuration File (config.json): This file contains BigQuery external and internal table input parameters.
- Library File (bqlib.py): This file contains user-defined library functions to perform data extractions and data loading.
- Entry Point/Main File (bqDataLoad.py): This file contains Cloud Function entry point code, which invokes library functions
Test Result
The test result illustrated below demonstrates the successful loading of a 681.2 MB file using Cloud Function.
Source Code Repository
https://github.com/soumendra-mishra/bigquery-data-loader.git