A review of the Google Datastream Service - Cloud SQL (PostgreSQL) to BigQuery

Keven Pinto
Appsbroker CTS Google Cloud Tech Blog
8 min readMar 23, 2023

Datastream introduced the ability to seamlessly replicate data from Cloud SQL to BigQuery late in 2022. Given that (near) real time replication of operational databases is one of the most common uses cases we observe as part of our client requirement, we decided to review this service and put it through its paces. We were quite interested in observing how the stream would behave in the following scenarios:

  • DMLs on Tables without Primary Keys
  • Zonal Failover
  • Regional Failovers
  • DMLs on Partition Tables
Data Stream

For the purpose of this test, we are looking at the Cloud SQL (PostgreSQL 14) to BigQuery integration via Datastream.

Before we proceed further, we’d like to state that this service is a Pre-GA offering — Beta in layperson speak. This means that any features or quirks we may point out in this article may or may not exist in the end product. We also strongly advise a review of the product documentation closer to the time of implementation as features may have changed, evolved or been dropped.

Our Test Setup

Architecture

Our setup composed of the following services:

  • A HA (2 zones) Cloud SQL Instance using the PostgreSQL 14 db engine in Region 1
  • A read-replica Cloud SQL Instance in Region 2
  • A Datastream in Region 1 reading from Our Cloud SQL instance in Region 1 and writing to Single Region BigQuery Dataset in the same region, Staleness limit for this replication set to 0 seconds
  • A second BigQuery dataset in Region 2 being refreshed at regular intervals using the BigQuery dataset copy service… please refer to my other blog for some of my observations on this service before committing to it
  • A VM hosting our cloud sql auth proxy
  • Private Google access enabled on our VPC Subnet as our Cloud SQL instance was with a Private IP Only
  • A python function running from my desktop that writes to our db every n seconds via a tunnel through to our cloud sql auth proxy

For details on how to setup your Datastream, kindly refer to my other blog, here is a link to that sample repo from the other blog.

Test 1 — Write to a table with a Primary Key defined in PostgreSQL

create table data_schema.test_datastream(
myid SERIAL PRIMARY KEY,
somecol varchar(100),
log_date TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP)

We started with simply writing some records to Postgres and waited for these to appear in Bigquery.

This works a treat, our data is streaming to Bigquery!

Test 2 — Simulate Failover of PostgreSQL in Region 1

For this test, we manually initiated a Failover on the Instance from the console, kindly refer here for details on how to achieve this. Our Python routine is attempting to write to Cloud SQL throughout this failover.

Observations:

  1. Our instance seamlessly moves from europe-west2-c to europe-west2–a
  2. Writes from the Python routine timeout during failover but then begin to get committed to the db once the failover completes
  3. Datastream complains for a bit and then resets the connection once failover completes automatically
  4. Records in BigQuery and Cloud SQL match

We ran the failover a few times back and forth and each time, the datastream came back again by itself without the need for any manual intervention. Zonal failovers for Cloud SQL and Datastream are seamless. All going well so far!

Test 3: Add a Table with no PK and NO Replica Identity

create table data_schema.test_datastream_nopk(
myid SERIAL,
somecol varchar(100),
log_date TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP);

Observations:

  • This table is not replicated to BigQuery, this is in line with the PostgreSQL requirements stated here

Test 4: Alter table with No PK and add a Replica Identity to it

Alter table data_schema.test_datastream_nopk REPLICA IDENTITY FULL;

The purpose of a Replica Identity as per the PostgreSQL docs…

“This form changes the information which is written to the write-ahead log to identify rows which are updated or deleted. In most cases, the old value of each column is only logged if it differs from the new value; however, if the old value is stored externally, it is always logged regardless of whether it changed. This option has no effect except when logical replication is in use.”

Observations:

  • This table is now replicated to BigQuery

Test 5: Test the delete feature

delete from data_schema.test_datastream where myid=3;
delete from data_schema.test_datastream_nopk where myid=3;

Observations:

  • These records are correctly dropped from BigQuery as well

Test 6: Test the Update feature


update data_schema.test_datastream set somecol='newval' where myid=1;
update data_schema.test_datastream_nopk set somecol='newval' where myid=1;

Observations:

  • PK Table in PostgreSQL and BigQuery in sync
  • Non PK Table shows up in BigQuery as a new row — Please note setting a table with a REPLICA IDENTITY of FULL is discouraged… here is an excerpt from the postgres docs

“If the table does not have any suitable key, then it can be set to replica identity “full”, which means the entire row becomes the key. This, however, is very inefficient and should only be used as a fallback if no other solution is possible”

New Row in BQ

Test 7: DML on Partition Tables

The purpose of database partitioning is to provide support for very large tables and indexes by splitting them into smaller pieces.

Partitioning in PostgreSQL is declarative, this means the partitioned table itself is virtual having no storage of its own. The data is stored in the Individual tables that store the data. This behavior is quite different to other popular databases like Oracle which support Horizontal partitioning wherein the management of individual partitions is managed by the table itself.

It is important to understand this difference in implementation because it has an impact on what you see at the other end of your stream.

Let’s take a look at the definitions of a partitioned table in Oracle V/S PostgreSQL to better understand this.

# ORACLE
CREATE TABLE data_schema.test_datastream_part
(
my_id NUMBER GENERATED BY DEFAULT AS IDENTITY,
somecol varchar(100),
log_date TIMESTAMP (2) NOT NULL,
CONSTRAINT pk_test_datastream_part PRIMARY KEY(myid,log_date);
)
PARTITION BY RANGE(log_date)
(
PARTITION log_date VALUES LESS THAN (TO_DATE('01/01/2022','DD/MM/YYYY')) TABLESPACE TB1,
PARTITION log_date VALUES LESS THAN (TO_DATE('01/01/2023','DD/MM/YYYY')) TABLESPACE TB2,
PARTITION log_date VALUES LESS THAN (TO_DATE('01/01/2024','DD/MM/YYYY')) TABLESPACE TB3
);
# POSTGRESQL

CREATE TABLE data_schema.test_datastream_part
(
myid SERIAL,
somecol varchar(100),
log_date timestamp(0) without time zone NULL
)
PARTITION BY RANGE(log_date);

ALTER TABLE data_schema.test_datastream_part ADD CONSTRAINT pk_test_datastream_part primary key(myid,log_date);

CREATE TABLE data_schema.test_datastream_part_2021 PARTITION OF data_schema.test_datastream_part FOR VALUES FROM ('2021-01-01') TO ('2022-01-01');
CREATE TABLE data_schema.test_datastream_part_2022 PARTITION OF data_schema.test_datastream_part FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');
CREATE TABLE data_schema.test_datastream_part_2023 PARTITION OF data_schema.test_datastream_part FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');

In Oracle, the partition table holds the partition definition as well as the individual partitions. There is no need to define individual tables separately and attach these to the main partition table.

In PostgreSQL, the partition table is Virtual, data is held in the individual tables associated to the Virtual partition table. Each of these tables need to be defined separately i.e. data_schema.test_datastream_part_2021, data_schema.test_datastream_part_2022, data_schema.test_datastream_part_2023

So, why am I telling you all this?

The reason is that this behavior of PostgreSQL has an impact on the datastream. If we consider the example from above where each postgreSQL partition table holds 10 records, One will see 4 tables in BigQuery rather than just 1. These 4 tables will be as follows:

  • test_datastream_part with 30 records
  • test_datastream_part_2021 with 10 records
  • test_datastream_part_2022 with 10 records
  • test_datastream_part_2023 with 10 records

In short, we now have data being duplicated in BigQuery.

This situation can be remedied by setting the publish_via_partition_root=true on your postgreSQL database publication;

For more details on this subject, i’d like to signpost you’ll to this excellent blog.

Test 8: Simulate a Regional Failover

Promote Replica

This step involves the following :

  • Promoting the Read Replica in Region 2 to Primary
  • Shutting down the Original Primary SQL Instance to simulate a service outage
  • Point the auth proxy container to the newly promoted replica
docker run -d -p 5432:5432 gcr.io/cloudsql-docker/gce-proxy:1.33.1 /cloud_sql_proxy -instances=cloudsqlpoc-demo:europe-west2:demo-europe-west1=tcp:0.0.0.0:5432 -ip_address_types=PRIVATE

Kindly refer to this document for more details on this process.

Please note that once a read replica is promoted, it no more has any connection to the Original Instance. This is now a separate instance.

Observations:

  • The Promoted Replica does not contain a replication slot, this behaviour is explained in detail here
  • The Stream is broken and will not self restore as we observed in the Zonal failovers, the reason for this behaviour is because this is a completely different instance without a replication slot and the stream does not have a connection to it

Summary

  • Streams are a great was to implement change data capture between a transactional datastore (Postgres) and a warehouse without writing a line of code
  • Datastream accurately tracks insert, updates and deletes and replicates these across to BigQuery
  • Source tables that do not have a Primary Key or a Replica Identity are ignored
  • Tables with a Full Replica Identity have a limitation with tracking updates. update records manifest themselves as new records on BigQuery
  • Datastream works seamlessly in case of Zonal Failovers in a Region, it restarts itself as soon as the instance fails overs to a new zone
  • A Datastream must be paused in a Regional Failover. As the promoted read replica is a new instance and has no more ties to the old instance
  • In a Regional Failover, the stream being setup is a new stream and has no connection to the old stream. Make sure to set your backfill strategy accordingly, especially if data from BigQuery is being regularly synced to site B

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!

--

--

Keven Pinto
Appsbroker CTS Google Cloud Tech Blog

Traveller | Eco warrior | Data Engineer | Curious Fella | Foodie | Father