Data Mesh Self Service — Automatic Schema evolution in BigQuery for any change in Spanner using Spanner change Stream

Published in

Google Cloud - Community

5 min readDec 23, 2022

In the previous blog of this series ‘Data Mesh Self Service — Ingestion Pattern from Spanner to BigQuery (Data Store for Mesh)’, the various approaches to fetch data from Spanner to BigQuery were discussed. One of the preferred approaches for streaming Spanner changes to BigQuery was described using Spanner Change Streams and Dataflow. In this blog, we will discuss the schema evolution challenge at Spanner and the ways it can be resolved using various methods.

About Spanner Change Stream:

A change stream watches and streams out a Cloud Spanner database’s data changes — inserts, updates, and deletes — in near real-time. It can be used to replicate Spanner data changes to a data warehouse, such as BigQuery, for analytics. A dataflow streaming job can be used for fetching the data from a change stream. Google also provides templates that let you rapidly build Dataflow pipelines for common change stream use cases, including sending all of a stream’s data changes to a BigQuery dataset.

Spanner to BigQuery using Spanner Change Stream and Dataflow job

Objective:

Currently, if there is a schema change(column addition) at the Spanner database, the dataflow job using Change Stream fails and is not able to automatically propagate the changes at BigQuery. The objective of this blog is to achieve automatic schema evolution at BigQuery whenever there is a column addition and data insertion/updation at the Spanner table watched by a Change Stream.

Dataflow template Cloud Spanner change streams to BigQuery:

The Cloud Spanner change streams to BigQuery template is a streaming pipeline that streams Cloud Spanner data change records and writes them into BigQuery tables using Dataflow Runner V2.

If the necessary BigQuery tables don’t exist, the pipeline creates them. Otherwise, existing BigQuery tables are used. The schema of existing BigQuery tables must contain the corresponding tracked columns of the Cloud Spanner tables and the additional metadata columns that are not ignored explicitly by the ‘ignoreFields’ option. Each new BigQuery row includes all columns watched by the change stream from its corresponding row in your Cloud Spanner table at the change record’s timestamp.

All change stream watched columns are included in each BigQuery table row, regardless of whether they are modified by a Cloud Spanner transaction. Columns not watched are not included in the BigQuery row. Any Cloud Spanner changes less than the Dataflow watermark is either successfully applied to the BigQuery tables or are stored in the dead letter queue for retry. BigQuery rows are inserted out of order compared to the original Cloud Spanner commit timestamp ordering.

Limitations of Dataflow Template:

Right now if there are Schema Changes at Spanner Database (column addition) and any data change to the new column(insert or update), the dataflow job fails as it is not able to add columns in BigQuery and move the data. It generates an error and writes it to a DLQ(Dead Letter Queue) file in Google Cloud Storage.

Dead Letter Queue in Dataflow Template:

Let us now discuss the “DLQ(Dead Letter Queue)” functionality in the Dataflow Streaming template provided by Google to stream changes from Cloud Spanner change streams to BigQuery.

Whenever there is any issue/error while streaming data from Spanner Change Streams to BigQuery using the Dataflow job the log gets written to an error file inside a DLQ bucket(Default is a directory under the Dataflow job’s temp location). The dataflow tries to do the DML operation 5 times and then it moves the error file to a severe folder inside the bucket.

One such error scenario could be column addition in a table at the Spanner end which is being watched by the Change Stream. Whenever the data would get inserted or updated in the column it would be tracked by the streaming Dataflow job. Since the column does not exist at BigQuery it will generate a DLQ error file at the specified Bucket location as mentioned above.

Possible Solutions for Schema Evolution:

Now to automatically do the column addition and data insertion in BigQuery following different approaches can be used:

Using Sensors in Cloud Composer:

If the schema evolution is frequent and Composer is already being used in some other part of the project, we can utilise sensors in Composer which can sense DLQ file in Bucket and perform the necessary tasks to evolve BigQuery Schema and load data.

Using Cloud Function:

Cloud functions are serverless light weigh GCP service which can be invoked on various events. In our case, we can invoke it on the DLQ file creation event in the bucket. We can then write the custom code to check the file content and do the schema evolution in BigQuery. If the column is added on the first try, in the subsequent retry of DLQ logic the data would get added automatically to BigQuery without any data loss.

In addition to these services, some other services can also be used like Pub/Sub, and cloud logging/monitoring alerts which can notify the user that the schema has changed in the table.

Summary:

In previous part of the series Data Mesh Self Service — Ingestion Pattern from Spanner to BigQuery (Data Store for Mesh), we discussed Spanner to BigQuery pattern, different data loading approaches including streaming data inserts using Spanner Change Streams and Dataflow.

In this part, we discussed how to tackle schema evolution automatically using various approaches. One approach could be using Cloud Function which will update BigQuery Schema on the DLQ File arrival event. Another one discussed is using sensors in Cloud Composer which will sense the arrival of DLQ files in storage at regular intervals. It will update the BigQuery schema once the file is found at GCS.

In the coming part of the series, we would cover other ingestion patterns like File to BigQuery, Pub/Sub to BigQuery and so on.