Change Streams in Cloud Spanner | Replication to BigQuery

Published in

Google Cloud - Community

7 min readJun 29, 2022

Google Cloud Spanner

Cloud Spanner is Google’s highly scalable and fully managed relational database service. It is a replica of a traditional database structure combining additional features of horizontal scaling, ACID transactions and SQL semantics. It also provides strong consistency across multiple regions and still can be used with low latency. It has an availability SLA of 99.999% which rounds down the monthly downtime to only 23 seconds. In short, it is a no compromise database. Initially, Cloud Spanner did not have the functionality to capture data changes hence, to overcome this drawback Google has now introduced Cloud Spanner Change Streams. With this new feature, users can watch and stream out near-real time data changes from Spanner to different cloud services. Here, we will be specifically discussing on replicating spanner data changes to BigQuery for analytical purposes.

Methods to read change streams

Through Dataflow — using Apache Beam SpannerIO Connector which is the recommended way to read change streams. Google has also provided dataflow templates to build dataflow pipelines for common use cases like streaming data changes to BigQuery and Google Cloud Storage. We can expect more such templates to come up in the near future to integrate with different services.In this article, we will be discussing this method.
Through API — using Change Streams Query API. This method allows the flexibility of writing custom code/query that uses Spanner API to read change streams. The query must be executed using the ExecuteStreamingSql API. A special table valued function (TVF) gets automatically created along with change stream. This provides access to change stream records and has no limit on the size of the returned result set.

Technical Architecture

Prerequisites

An instance created in Cloud Spanner with databases and tables
A change stream created to be read from
A dataset created in BigQuery
A Service Account must be created and given the following IAM permissions

to access databases, tables and indices (spanner.databases.select)
for schema changes (spanner.databases.updateDdl)
read/write permissions to GCS bucket (storage admin role)
to initiate and execute Dataflow job (Dataflow worker)
read/write/query permissions to BigQuery dataset (bigquery.dataEditor).

Create Change Streams

The first step is to create a Change Stream assuming that the databases and tables are already created in the Cloud Spanner. Spanner treats change streams just like other schema objects and therefore operations like create, modify, suspend and delete can be performed on it using DDL statements. While creating a change stream we also get the flexibility to track changes on varied options. It could be for specific tables or specific columns or for the entire objects inside the database. We can have a single or multiple change streams inside a database. We also get the option to define the retention_period for the data being tracked by a particular change stream. The value can range from min 24h/1d to 7d max. By default the change stream takes 24h as retention_period. Once the change stream is created, you can go back to the database overview and check which objects are being tracked by the change streams.

Create a Dataflow pipeline

Second step is to create a Dataflow job with the existing template. For now, we have only two templates for change streams.

Once the job is created, it takes sometime to reach the running state as the Dataflow job runs on apache beam. The resources are spinned up on demand from scratch and gets deleted once the job is completed.

As soon as the job starts running we can trace job graph, execution details from the job page

Data Change — INSERT

All kinds of DML operations can be executed through the console. Starting with insert operation which has successfully pushed data into the existing table (student).

After the data is processed, the corresponding table gets created automatically in BigQuery with a naming convention as tablename_changelog. In this case it is student_changelog. The pipeline has such functionality that it will map the datatypes according to BigQuery and then replicate the schema as defined in Cloud Spanner. Few additional fields are also created as a part of the metadata table which creates on its own when the data is processed.

The inserted data in Cloud Spanner gets captured and shows up in the BigQuery table. To validate the type of operation getting captured we can check _metadata_spanner_mod_type_ column

Fig 11 : Inserted record in Bigquery Table

Data Change — UPDATE

Update one or more field records in Cloud Spanner

The fields which were updated can be seen in the BigQuery table with updated records without changing other column values. Even though updates can be tracked using change streams, we must be aware that in BigQuery the existing record does not get updated. Instead, a new record gets inserted into the table with all the updates and mod_type value being UPDATE.

Fig 13 : Updated record in Bigquery table

Data Change — DELETE

Delete one or more field records in Cloud Spanner

The data gets deleted and shows up as null records in BigQuery table

Fig 15 : Deleted record entered as null in Bigquery Table

Data Change — ADD TABLE

Add a table while the streaming job is still running, the table shows up in Spanner and since the change stream was set for “ALL”, it automatically starts watching the additional table created.

Fig 17 : Added table being watched by change stream

We can also perform DDL operations on the additional table created.

Even though the data gets inserted into the table in Cloud Spanner, it fails to reach BigQuery and gets copied into the cloud storage bucket as a failed record. This happens because the template does not have the functionality to capture changes happening at the schema level or table level while the pipeline is running. For any additional changes to be captured a new pipeline must be created.

Note : The storage bucket gets created on its own to store temp data, staging data and failed data as soon as a pipeline is established.

Fig 19 : GCS bucket capturing failed record

Data Change — DROP COLUMN FROM EXISTING TABLE

Modify the schema of the existing table in Cloud Spanner.

The modified schema is reflected in the Spanner but the change stream could not capture it. Neither the table schema gets altered in BigQuery nor a new table with changed schema gets created.

Fig 23 : Updated schema after dropping a column

Limitations of Change Streams

Only 10 change streams can be created per database.
Field types like JSON and STRUCT fail to get replicated into BigQuery.
It can watch all the changes happening to the tables and columns if defined for ‘ALL’.
It can only capture changes happening as a result of insert, update and delete operations done on the objects defined before creating the change streams.
It can watch data changes only for user created columns and tables.
It does not capture any modifications happening at database level for example add/drop tables, add/drop columns to existing tables.
A change stream does not watch other change streams.
As of now we only have two templates in Dataflow for change streams.

Conclusion

Change streams enable seamless streaming of data into BigQuery to ensure that the most recent data from Cloud Spanner is available for analytics and other downstream use cases.

Watch this space for more such articles. Any comments or feedbacks are more than welcome.

I would like to give special thanks Sekhar Mandapati, Sagar Chavan for encouraging me to work on this piece and also to Sushant Nigudkar for backing me up in case of hindrance.

References : https://cloud.google.com/blog/products/databases/track-and-integrate-change-data-with-spanner-change-streams

Change Streams in Cloud Spanner | Replication to BigQuery

Written by Palak Miharia