Configure Datastream for only certain DML’s on PostgreSQL as source.

This is part 1 of series on Solving real world scenarios on Datastream for PostgreSQL and BigQuery, if interested in checking out all curated scenarios click me.

PostgreSQL is one of the vastly adopted managed databases and lots of customers have a requirement to directly stream transaction changes to BigQuery. On Google Cloud, continuous streams from CloudSQL for PostgreSQL were one of the common asks from different customers as part of further consolidated analytics on downstream platform like BigQuery.
With Datastream for BigQuery recent releases, it provide initial backfill and continuous replication from different PostgreSQL compatible sources using logical replication. With Datastream, integration between CloudSQL and BigQuery is eased out with an enriching, simple, serverless and scalable approach. It is currently in public preview to test it out for different use case on streaming changes from transactional workload to BigQuery.

Primarily Insert, update and delete are common changes on transactional tables that need to be included in streams. But we might have a functional requirement to stream only insert or only insert and update for a specific list of tables. Currently, Datastream does not provide features to support selective data manipulation events to capture and publish on BigQuery.

In current blog we will walkthrough how to configure stream for specific events like insert only or only, insert and update on specific list of tables using PostgreSQL logical replication feature.

Let’s configure CloudSQL for PostgreSQL as sources with sample tables set.

Below is the summary of tables list and dml publish needed.

Datastream documentation is detailed on how to configure sources and enable logical decoding for managed instances.

As part of enabling selective publish, we will create three different publication and add tables as per publish required in source PostgreSQL database i.e. whether it has to be insert only or insert and update only.

Create Publication and configure publish properties

Summary of the publication created and tables added to it from psql console. We can add multiple tables to same publication and its changes will be published based on publication properties.

Each stream created by Datastream will fetch from an active replication slot using pgoutput plugin. Next steps we will be creating three different replication slots.

Create Replication slot for each publication

We have already configured the necessary connection profiles and enabled cloudsqlproxy for connectivity on GCE(Google Compute Engine) to CloudSQL as source.

Next steps, we will create three different streams based on the publish required. We have completed all necessary steps to create three different streams based on publish pattern and completed initial backfill of Data.

Create Streams in Datastream for PostgreSQL as source and BigQuery as target.

Let’s check how different streams push changes to BigQuery based on publish events configured for Table.
On Datastream stream 1 — insertonly-stream, it is subscribed for only insert. Let’s apply all dml at source side and check how it is consumed at target side.

Though we have pushed all dml but only events related to Insert were pushed to Datastream and BigQuery target for stream 1.

On Datastream stream 2- insertupdateonly-stream, it is subscribed for only insert and update. Let’s apply all dml at source side and check how it is consumed at target side.

Though we have pushed all dml but only events related to Insert and update were pushed to Datastream and BigQuery target for Stream 2.

Last stream 3 is subscribe to all events and it will capture delete as well including insert and delete.

Learning

With PostgreSQL publication, we can alter events to publish based on functional requirement for BigQuery as target with Datastream. In a single publication, we can include multiple tables for publishing specific dml’s based on configuration.
In the process we created multiple replication slots as per custom changes, we should test it to validate any overhead or impact on source with necessary configuration of workers tuned for logical replication-publisher.

Further Reading

Check out the part 2 of Solving real world scenarios on Datastream for PostgreSQL and BigQuery that includes “Configure Streams from Partition table in PostgreSQL to non partition in BigQuery using Datastream.”

--

--

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store