Data Mesh Self Service — Ingestion Pattern from GCS(Google Cloud Storage) to BigQuery (Data Store for Mesh)

Atul Guleria
Google Cloud - Community
4 min readDec 31, 2022

One of the most frequently used ingestion use cases in Google Cloud is loading a file from GCS(Google Cloud Storage) to BigQuery. There have already been a lot of templates built around this and each one of them serves various purposes. In this blog, we will discuss the various use cases and specific tools/templates to be applied. In addition, we will also see how we can cater to various file formats in one template.

Types of load:

  1. Schedule-Based: If there is a fixed time of arrival of the file in GCS or the load is expected to run at a particular time, the schedule-based approach can be used. Various factors such as file size, cost, performance contribute to selection of tools/services.
  • Cloud Composer: Whenever we think of a schedule-based load in GCP, Composer is the first thing which comes in the mind. Composer is a fully managed workflow orchestration service built on Apache Airflow. It has various inbuilt operators like gcs_to_bigquery Operator which can be used to load the file directly from GCS to BigQuery. If we need to write any custom logic or transformation, it can be scripted in python and run using Python Operator. Composer provides a very good UI which can be used to see the load for various days and the logs around that. It provides the capability to rerun a job from UI for a particular load. But with all these features, Composer is quite costly and that is a major factor in the selection of this tool.
  • Cloud Workflows: The other lightweight orchestration service provided in GCP is Cloud Workflows. They are quite cheaper as compared to Composer but come with some limitations as well. For example, we can rerun a DAG in Composer from the UI for a particular date but Workflows do not provide that flexibility. They also don’t have any in-built operators as there are in Composer.
  • In addition to this, there is Cloud Scheduler as well which can be used in combination with various GCS services to load files from GCS to BigQuery.
Schedule based load using orchestration tool

2. Event-based: If the arrival time of files is not fixed and the files need to be loaded as when they arrive, the event-based approach can be used. Cloud Functions is the service which can perform the task on the file arrival event. We can write a custom wrapper to load the file to BigQuery. The Cloud Function can call other GCP Services under the hood to perform the file-loading task efficiently.

Event based load using Cloud Function

Catering multiple file formats:

One of the use cases could be having various types of file formats for the same load. Suppose a table expects a CSV and an Avro file to be loaded. Now, both formats are different so the validations like schema check and header check should be different as well. Then to finally load the file in BigQuery the load job should differ as well for the specific file format.

For one template to be capable of loading various file formats, we can write a custom wrapper(say in Python). It can check the file extension and perform the validation and load corresponding to the extension. Hence same code can be enhanced to add logic for various file formats. The entry point can be the same and then based on the format it can perform format-specific load.

Loading mutiple file formats with single template

Loading with transformations:

In some cases, the data needs to be transformed from the file before it gets loaded in BigQuery. The transformations may require some amount of computing and memory from the processing tool/service.

One such tools provided in GCP is Dataflow. Dataflow is a unified stream and batch data processing that’s serverless, fast, and cost-effective. It can be used to transform the data using beam transformations and load in BigQuery. It can autoscale as per the file size and processing requirements.

Cloud dataflow to load file with transformation

Summary:

In the previous part of the series Data Mesh Self Service — Ingestion Pattern from Spanner to BigQuery (Data Store for Mesh), we discussed Spanner to BigQuery pattern and how to handle Automatic Schema Evolution in BigQuery for any schema change in Spanner.

In this part, we discussed various approaches to load the files from GCS to BigQuery(Schedule-Based and Event-Based). We also saw the various GCP service offerings to load files. In addition to this, we also touched upon how we can cater to various file formats in one template by creating a custom wrapper. Then we discussed using dataflow to load the file which requires some data transformations.

In the coming part of the series, we would cover how we can tackle Automatic Schema Evolution in BigQuery in case of additional columns come in files without any manual intervention.

--

--