Configure Streams from Partition table in PostgreSQL to non partition in BigQuery using Datastream

Deepak Mahto
Google Cloud - Community
5 min readNov 14, 2022

This is part 2 of series on Solving real world scenarios on Datastream for PostgreSQL and BigQuery, if interested in checking out all curated scenarios click me.

PostgreSQL database offers enterprise-grade database features like partitioning and more. When it comes to partitioning and streaming changes using logical replication, we have configurations like publish_via_partition_root that influence how changes need to be propagated to a subscriber. PostgreSQL supports declarative partitioning that allows declaring a partition for a table and a root table being a virtual table that does not store any data; all data reside in specific partitions.

We can have scenario from customer to merge data across partitions in PostgreSQL as source to be merged into single table in BigQuery.

In today’s blog, we will provide an approach on how to merge all changes from a partitioned table in PostgreSQL as source into a single table in BigQuery using publish_via_partition_root configuration of publication and see how to it capture new partitions data as well from Source.

Let’s build a sample partitioned table in PostgreSQL with the necessary publication and replication slot created.

PostgreSQL partitioned table and Publication configuration

CREATE TABLE order_partition_pk(
order_id bigint NULL,
ordered_at timestamp(0) without time zone NULL,
shipped_at timestamp(0) without time zone NULL,
order_amount double precision NULL,
customer_id varchar(255) NULL
) PARTITION BY RANGE( ordered_at);

alter table order_partition_pk add constraint pk_order_partition_pk primary key(order_id,ordered_at);

CREATE TABLE order_partition_pk_2020 PARTITION OF order_partition_pk FOR VALUES FROM ('2020-01-01') TO ('2021-01-01');
CREATE TABLE order_partition_pk_2021 PARTITION OF order_partition_pk FOR VALUES FROM ('2021-01-01') TO ('2022-01-01');
CREATE TABLE order_partition_pk_2022 PARTITION OF order_partition_pk FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');

insert into order_partition_pk
select col1,
date_trunc('day',clock_timestamp()::timestamp(0) without time zone - interval '1 day' * mod(col1,700)),
clock_timestamp()::timestamp(0) without time zone + interval '1 day' - interval '1 day' * mod(col1,700) ,
col1*0.33,
col1::text
from generate_series(1,100000) as col1 ;

create publication pub_partition for all tables;

SELECT PG_CREATE_LOGICAL_REPLICATION_SLOT('rep_slot_partition', 'pgoutput');

We will alter the publication with publish_via_partition_root configuration as true.

alter publication pub_partition set (publish_via_partition_root=true);

ds_sample=> \dRp
List of publications
Name | Owner | All tables | Inserts | Updates | Deletes | Truncates | Via root
---------------+----------+------------+---------+---------+---------+-----------+----------
pub_partition | postgres | t | t | t | t | t | t
(1 row)

We have necessary partition table being created and populated with three partitions.

Create Streams for Partitioned table in Datastream for BigQuery as target.

We have already configure necessary connection related setup and will create stream for partitioned table in PostgreSQL as source to BigQuery as target in Datastream.

Datastream public preview — PostgreSQL to BigQuery

While selecting partitions table, we intentionally ignore selecting its partitions and only selected parent or root table. It should not be selected to avoid discrepancy at BigQuery Target that might happen between initial backfill and continuous changes due to publish_via_partition_root settings.

Post creating streams and starting it, we can see our initial backfill is completed.

Datastream Backfill completed

Apply DML in PostgreSQL on partitioned table and validate in BigQuery.

We will apply dml on partitioned table and add new partitions with mock data.
Applying Insert and Update on PostgreSQL source.

insert into order_partition_pk  
select col1,
date_trunc('day',clock_timestamp()::timestamp(0) without time zone - interval '1 day' * mod(col1,700)),
clock_timestamp()::timestamp(0) without time zone + interval '1 day' - interval '1 day' * mod(col1,700) ,
col1*0.33,
col1::text
from generate_series(100001,101000) as col1 ;

update order_partition_pk
set order_amount = order_amount+0.33
where order_id in (select generate_series(100001,101000));

We can verify from Datastream console that new DML events were captured.

Datastream events

Quick check in BigQuery console can confirm it.

BigQuery — SQL Checks

In next changes, we will create new partitions for subscribe partitions table and apply DML.

ds_sample=> CREATE TABLE order_partition_pk_2019 PARTITION OF order_partition_pk FOR VALUES FROM ('2019-01-01') TO ('2020-01-01');
CREATE TABLE
ds_sample=> insert into order_partition_pk
select col1,
dd,
dd + interval '1 day' ,
col1*0.33,
col1::text
from generate_series(101000,101005) as col1 , generate_series
( '2019-01-01'::timestamp
, '2019-12-31'::timestamp
, '1 day'::interval) dd;
INSERT 0 2190

We can verify from Datastream console that new DML events were captured.

Quick check in BigQuery console can confirm it.

SELECT * FROM `databaseproj-366912.ds_public_partitionpublic.order_partition_pk` where order_id > 100999
BigQuery SQL Checks.

Conclusion

If we have a scenario to consolidate partitions table from PostgreSQL as source to BigQuery as target, we can use publication configuration publish_via_partition_root to achieve it. To avoid any discrepancy it is advisable to not select partitions of the concern table while creating streams.
All scenario is tested on public preview release of Datastream with support for PostgreSQL as source and BigQuery as target.

Further Reading

Check out the part 3 of Solving real world scenarios on Datastream for PostgreSQL and BigQuery that includes “Configure Streams in DataStream with predefined tables in BigQuery for PostgreSQL as source”

--

--

Deepak Mahto
Google Cloud - Community

Database Migration Expert - Enabling success with PostgreSQL on Cloud.