Using Datastream as a Data Sync and DR tool

Published in

Appsbroker CTS Google Cloud Tech Blog

5 min readMar 14, 2023

In my last blog we looked at how near real time BigQuery data sync across two regions could not be achieved using the dataset copy service provided by Google and how we had to abandon the approach. In this blog, we talk about the eventual solution we adopted to meet our objective.

Initial Architecture

Before we explore the final solution, let’s take a quick journey through the evolution of our architecture.

Our objective was to have a Multi Region Active-Passive Setup. Region 2 being our DR failover site.

Region 1 Components:

HA Cloud SQL PostgreSQL 14 Db that is CMEK encrypted
Logical replication enabled (cloudsql.logical_decoding=on) on the PostgreSQL db and complete with a replication slot and a publication created
BigQuery Dataset to hold data from our Primary Cloud SQL Db
Datastream Stream that hydrates the BigQuery dataset with all changes happening on the Cloud SQL database — CDC in short
A Cloud SQL HA proxy to bridge the connection from Datastream to our Cloud SQL instance, more details can be found here
A BigQuery Dataset Copy service that refreshes our dataset in Region 2

Region 2 Components:

A Read Replica of our Primary Cloud SQL instance, the replica is encrypted using a different CMEK key to the primary as the replica is in a different region
A Replica BigQuery dataset that acts as failover to our Region 1 dataset

As mentioned in my last blog, this approach was abandoned due to the limitations of this service. Please refer to the blog for more details

Attempt 2

As part of attempt 2 we did the following:

We dropped the data set copy service
Attempted to setup a second Stream from the Read Replica in Region 2 to the Dataset in Region 2

This attempt failed due to the limitation documented here. Now, if you are wondering, why didn’t we read the docs carefully? Well, we did. This update to the doc was kindly added on by Google as soon as we raised this concern with them.

Final Architecture

I would love to state that the final solution appeared to me in a fever dream… a pure architectural epiphany. Sadly, such was not the case. Just pure perseverance and belief from the team and support from the client helped us arrive at the final solution.

The Solution

The solution was actually quite simple and as follows:

Create 2 Replication slots on the database
Create 2 corresponding publications on the database
Setup a stream in Region 1 to look at the first Replication slot and push changes to the Dataset in Region 1
Setup a Stream in Region 2 to look at the second Replication slot and push changes to the dataset in Region 2

Below is copy of our SQL Setup Script that shows the setup of the Replication slots and Publications. This script needs to be run by a user with Superuser privileges. For Cloud SQL this user is the postgres user that comes pre installed on the instance. We store all our tables in a schema called data_schema and usage on this schema needs to be granted to the user being used for replication (postgres).

DO
\$do$
DECLARE
 rs1 character varying(255);
 rs2 character varying(255);
BEGIN
    ALTER USER ${user} with REPLICATION;

    IF NOT EXISTS (SELECT 1 FROM pg_publication WHERE pubname = 'pub1') THEN
        RAISE NOTICE 'Creating Pubication pub1';
        CREATE PUBLICATION pub1 FOR ALL TABLES  WITH (publish = 'insert,update,delete') ;
    END IF;

    IF NOT EXISTS (SELECT 1 FROM pg_publication WHERE pubname = 'pub2') THEN
        RAISE NOTICE 'Creating Pubication pub2';
        CREATE PUBLICATION pub2 FOR ALL TABLES  WITH (publish = 'insert,update,delete') ;
    END IF;

    COMMIT;
    IF NOT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'rs1') THEN
        RAISE NOTICE 'Creating Replication Slot rs1';
        rs1 := (SELECT PG_CREATE_LOGICAL_REPLICATION_SLOT('rs1', 'pgoutput'));
    END IF;
    COMMIT;

    IF NOT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'rs2') THEN
        RAISE NOTICE 'Creating Replication Slot rs2';
        rs2:= (SELECT PG_CREATE_LOGICAL_REPLICATION_SLOT('rs2', 'pgoutput'));
    END IF;

    CREATE SCHEMA IF NOT EXISTS ${schema};
    GRANT SELECT ON ALL TABLES IN SCHEMA ${schema} TO ${user};
    GRANT USAGE ON SCHEMA ${schema} TO ${user};
    ALTER DEFAULT PRIVILEGES IN SCHEMA ${schema} GRANT SELECT ON TABLES TO ${user};

END
\$do$;

Accompanying Code

Here one can find a link to a sample repo that will help you setup all the assets shown in the final architecture, the code is written in terraform using Google Cloud Fabric Modules. Please read the Readme in the repo for more more details on how to set this up.

Start the Streams

The streams in the accompanying code repo are deployed in the NOT_STARTED state. Start these manually once you have created tables (must have PK) and have added some data to these tables.

Final State

Once the stream starts, it runs a back fill process and writes data to our BigQuery datasets in both Regions.

That’s it… have a go!

Note: Given the business requirement outlined here, it might not be long before Google update their offerings around cross-region data set replication - until then, this approach works just as well.

Finally, thanks to my colleague @iamtheone.ankur for help with setting up the VPC private connectivity and acting as a great sounding board to all my ideas.

References:

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!

Using Datastream as a Data Sync and DR tool

Written by Keven Pinto