Automating BigQuery Table Creation from Google Cloud Storage Using Dataflow and Cloud Functions

4 min readMay 13, 2024

Introduction
In today’s data-centric landscape, organisations frequently require swift and efficient processing of substantial datasets. Google Cloud Platform (GCP) provides robust tools like Google Cloud Storage (GCS) for data storage, BigQuery for analytics, Dataflow for data processing, and Cloud Functions for serverless computing. This article delves into the automation process of creating BigQuery tables upon file uploads to a GCS bucket using Dataflow and Cloud Functions.

Steps

1. Setting up the Environment: Ensure you have a GCP account and project set up. Activate the necessary APIs for Cloud Storage, BigQuery, and Dataflow in your project. Additionally, install the required Python libraries for interacting with GCP services.

2. Writing the Cloud Function: Develop a Cloud Function triggered by file uploads to a specific GCS bucket. This function should extract metadata from the uploaded file, such as its name and location, and trigger a Dataflow job.

3. Developing the Dataflow Pipeline: Use an existing Dataflow template stored in Cloud Storage for text files to BigQuery ingestion. This template handles the extraction of data from text files in GCS and loading it into BigQuery tables. Configure the Dataflow job to dynamically create tables based on the file schema.

4. Configuring the Cloud Function Trigger: Configure the Cloud Function to trigger the Dataflow job whenever it’s invoked by a file upload event in the GCS bucket. Pass the metadata of the uploaded file as parameters to the Dataflow job.

JavaScript File for Dataflow:

JavaScript file contains a function named schema_function that takes a line of text as input, splits it by commas, creates a JSON object with specific fields, converts the object to a JSON string, and returns it. This function will be used as a schema function in the Dataflow pipeline to transform the input data into JSON format.

JSON File for Data Schema:

Create a JSON file that defines the schema of the data to be processed by the Dataflow job. This schema should match the fields generated by the schema_function in the JavaScript file.

Python File for Cloud Function:

Python file defines a Cloud Function named gcs_to_bigquery_trigger_job that will be triggered by events from Google Cloud Storage. This function uses the Dataflow API to launch a Dataflow job using a template stored in Cloud Storage.
It specifies the parameters required for the Dataflow job, including the input file pattern (inputFilePattern), the JSON schema file path (JSONPath), the output BigQuery table (outputTable), the temporary directory for loading data (bigQueryLoadingTemporaryDirectory), the path to the JavaScript schema function (javascriptTextTransformGcsPath), and the name of the JavaScript function (javascriptTextTransformFunctionName).

Final result

After configuring the Dataflow pipeline and Cloud Function as outlined, the process of uploading a file to Google Cloud Storage triggers the automated creation of a Dataflow job. This job applies the JavaScript schema function to transform the raw data into JSON format and loads it into the specified BigQuery table. As a result, the data is seamlessly ingested into BigQuery, where it can be easily queried and analyzed.

Conclusion

By combining the capabilities of GCP’s server less offerings such as Cloud Functions and Dataflow with existing Dataflow templates, organizations can automate the process of creating BigQuery tables from data stored in GCS. This automation streamlines data ingestion and analysis workflows, enabling faster insights and decision-making. With proper configuration and monitoring, this solution can scale seamlessly to handle large volumes of data efficiently. Start automating your data pipeline today to unlock the full potential of your data analytics platform on Google Cloud Platform.

Automating BigQuery Table Creation from Google Cloud Storage Using Dataflow and Cloud Functions

Written by Sushithks