Data Mesh Self Service — Ingestion Pattern from Spanner to BigQuery (Data Store for Mesh)

Sanchit Malhotra
Google Cloud - Community
6 min readDec 14, 2022

A centralized-everything approach of creating and managing data sources is often being considered as the bottleneck for its integration into a business’s analytical stack. With futuristic advancements in technology, it has become all the more easy to ingest, store, process, and access much more data across ecosystems without costing a fortune.

Such advancement of technology needs to be complemented with equivalent changes in the organizational approach to manage data.

Data mesh is an advanced architecture which addresses the failure of centralized-everything approach, which is not easy to scale and complement from an organization’s point of view.

Photo by John Schnobrich on Unsplash

Principles of Data Mesh

  1. Decentralized Domain-Oriented Ownership
  2. Data as a Product
  3. Self-serve Data Platform
  4. Federated Governance

In the previous part of this series ‘Build Data Product in Data Mesh using Templates’ we discussed using templates as one of the aspects of Self Service in building and publishing data products in the mesh.

Data Store for Mesh

The application’s transactional storage could be any storage system such as a SQL database, a NoSQL database, File storage or Object Storage. To use this data for any analytical purposes, this has to be ingested to storage suitable for running analytics workloads, such as data warehouse or data lake.

Data products in a mesh architecture are built on top of a data store suitable for analytical workloads. When building a mesh on Google Cloud Platform, we typically use BigQuery as the Data Store for the Mesh.

In this part of the series, we’ll discuss one such scenario and discuss the Spanner to BigQuery data ingestion pattern and a self-service template for such a pattern which could be used by different domains to build data products in BigQuery.

Ingestion Pattern — Spanner to BigQuery

Data Ingestion from Spanner to BigQuery could be performed in one of the three ways:

  1. Batch ModeBulk Load of data between from Spanner to BigQuery at regular intervals
  2. Streaming ModeStreaming Change Data of Spanner to BigQuery in near real time
  3. Parallel Mode Combination of Bulk Load and Streaming CDC Load

Let’s explore the Pros and Cons of above options and identify the best approach to templatize the Spanner to BigQuery Ingestion Pattern.

Batch Mode — Bulk Load of Data from Spanner to BigQuery

One very easy but inefficient approach to load data from tables in a Spanner Database to a BigQuery dataset is to perform a full copy bulk transfer of data in a one-time load operation.

This could be achieved by using the Cloud Spanner Federated Queries for BigQuery. It’s implementation would require a two step approach:

  1. Query the Spanner information_schema views to read table schema and create a table in BigQuery.
  2. Read the spanner table data using federated query and insert it into BigQuery table

The following example returns information about the columns in the table MyTable:

SELECT *
FROM EXTERNAL_QUERY(
'my-project.us.example-db',
'''SELECT t.column_name, t.spanner_type, t.is_nullable
FROM information_schema.columns AS t
WHERE
t.table_catalog = ''
AND t.table_schema = ''
AND t.table_name = 'MyTable'
ORDER BY t.ordinal_position
''');

The following example makes a federated query to a Spanner database named orders and joins the results with a BigQuery table named mydataset.customers.

SELECT c.customer_id, c.name, rq.first_order_date
FROM mydataset.customers AS c
LEFT OUTER JOIN EXTERNAL_QUERY(
'my-project.us.example-db',
'''SELECT customer_id, MIN(order_date) AS first_order_date
FROM orders
GROUP BY customer_id''') AS rq
ON rq.customer_id = c.customer_id
GROUP BY c.customer_id, c.name, rq.first_order_date;

This approach would require one of the following resources to be deployed through an IaC template:

  • Cloud Function with a script leveraging BigQuery and External Connection APIs and SQL queries for data load
  • Cloud Run with the containerised web application internally leveraging BigQuery and External Connection APIs and SQL queries for data load
  • Cloud Composer or Cloud Workflows for orchestrating the BigQuery and External Connection APIs and Data Load tasks

The IaC template for this approach would vary based on the choice of service from above options.

Pros:

  • Loading full data every time is very simple to implement.
  • This approach does not have any dependency on a specific timestamp field for incremental data identification.

Cons:

  • For large tables in Cloud Spanner, loading full data every time can be slow and costly.
  • There would be a requirement to pause writes to Spanner for the time a full load is in progress; otherwise the writes during full load could be missed.

Streaming Mode — Streaming Spanner CDC to Bigquery

Another way to fetch Cloud Spanner data to BigQuery is by capturing the change data from CRUD operations, i.e., the INSERTS, UPDATES and DROPS to the tables in Spanner database and stream them to BigQuery table.

The change data from spanner could be captured by using the Spanner Change Streams. A change stream watches and streams out a Cloud Spanner database’s data changes — inserts, updates, and deletes — in near real-time. More about Spanner Change Stream could be explored here, along with the setup and configuration procedure.

Once the change stream is configured, the data change records are captured by the change stream.These data change records could be streamed to BigQuery to load and synchronize the data between Spanner and BigQuery tables. This could be achieved by using Dataflow with google-provided template ‘Cloud Spanner change streams to BigQuery’.

The Cloud Spanner change streams to BigQuery template is a streaming pipeline that streams Cloud Spanner data change records and writes them into BigQuery tables using Dataflow Runner V2.

This approach would require all of the following resources to be deployed through an IaC template:

  • Cloud Spanner Instance and Database, tables and Spanner Change streams by specifying DDL queries.
  • Cloud Dataflow Streaming Job using google-provided template ‘Cloud Spanner change streams to BigQuery’.
  • BigQuery Dataset to be used as the target for the data change records being streamed by the Dataflow streaming job. The tables would be auto-created by the dataflow job (implemented within the template).

Pros:

  • This does not require any query federation and external connections.
  • The data change records from spanner are streamed to BigQuery in near real time.
  • This approach is independent of the number of records and size of table in spanner and is easily scalable.

Cons:

  • This would only capture the changes from the point in time when the Spanner change streams are enabled. Historical data could not be streamed to Bigquery.

Parallel Mode — Combination of Bulk Load and Streaming CDC Load

Considering the Pros and Cons of the above two approaches, we can eliminate the cons of each individual approach by combining the two approaches, i.e. a combination of the batch full load and streaming CDC load to fetch data from Spanner and load to BigQuery.This would include parallelization of one-time full data load from Spanner to BigQuery using Federated Queries.

Addressing the Cons:

  • Since this is a one-time load and not a recurring full load, it would incur the time and cost only once — during the first full load cycle.
  • Enabling Change stream in parallel to history load, the subsequent write during the history load would also be captured in parallel thus addressing the issue of data records being missed during full load.

The IaC template for this approach would include two simultaneous sub-pipelines, batch full load and streaming CDC load. This would essentially be the combination of IaC of first and second approach with one common master configuration file.

The order of deployment and execution could be the following:

  1. Enable Spanner Change Streams.
  2. Deploy Batch pipeline for full load.
  3. Deploy Streaming pipeline for CDC load. The CDC load would be auto-initiated as soon as the dataflow job is provisioned.
  4. Initiate the batch pipeline for history load (to ensure changes are captured and loaded in parallel to full load).

Pros:

  • Efficient approach to extract only new changes and merge with existing data in BigQuery
  • Easy to replicate large tables with high throughput (lots of inserts or updates written per second), with low to medium replication lag; near real time in long run.

Summary
In the previous part of this series ‘Build Data Product in Data Mesh using Templates’ we discussed using templates as one of the aspects of Self Service in building and publishing data products in the mesh.

In this part, we discussed one such requirement of an Ingestion Pipeline Template for the Spanner to BigQuery pattern, different data loading approaches and a few snippets of scripts and IaC templates which can be used as the base for building ingestion capabilities for mesh.

In further parts of this series of blogs, we would cover a few more ingestion pipelines such as moving data from Cloud SQL to BigQuery, Text Files in Cloud Storage to BigQuery, Pub/Sub to BigQuery and/or Apache Kafka to Bigquery.

--

--