Configure Streams in Datastream with predefined tables in BigQuery for PostgreSQL as source.

Published in

Google Cloud - Community

4 min readNov 18, 2022

This is part 3 of series on Solving real world scenarios on Datastream for PostgreSQL and BigQuery, if interested in checking out all curated scenarios click me.

I did not anticipate playing around with the public preview of streaming from PostgreSQL to BigQuery using Datastream will be so much fun. This is my Third Blog, if you wanna ride the fun ride have a good read of the first and second blog related to Datastream with PostgreSQL as source.

What if we have a customer scenario for PostgreSQL to BigQuery to replicate from non partition tables as source to partition table or alter columns data type. We will explore approaches on how to achieve it but first let’s get some more understanding around tables created in BigQuery as part of streaming from PostgreSQL using Datastream.

Primary key is an important constraint for replication to capture changes and apply at target. The primary key is transformed as a cluster as part of table definition in BigQuery. Let’s take a sample table for reference. We will map how Datastream transforms a table in PostgreSQL in BigQuery.

PostgreSQL table have primary key constraint defined on the order_id column.

Datastream will transform the table as below.

Take a look at both screenshots and try to map the details between PostgreSQL and BigQuery Table. It included multiple data points, let’s discuss based on how it is numbered.

1. Primary key in PostgreSQL Table.

Primary is transformed as Clustering within the BigQuery table. Clustering is primarily sorting order of data within a table based on columns. It improves query performance and reduces cost in BigQuery. The primary key on the same set of columns is applied in BigQuery as well but is NOT ENFORCED.

2. Datastream Metadata.

Datastream adds additional metadata to events i.e. changes pushed to BigQuery as part of initial backfill and continuous changes. It includes a unique event id and timestamp of events at source.

3. Max Staleness

While creating a stream, we have an option to configure staleness limit. It defines data freshness and has a direct impact on cost. Based on the staleness limit, Datastream create table and defines max_staleness for table in BigQuery.

4. Table Name in BigQuery.

Datastream creates tables with the same dataset or dataset as per schema of table from source. The table is usually appended with schema name i.e. <<schemaname_table_name>> within BigQuery or as per prefix set while creating stream.

The reason to share all the details is to build understanding on the BigQuery table created by Datastream and use it as reference on overwriting table definition if any scenario arises.

If we need to define partitioning on BigQuery tables, getting replicated using Datastream from a source can be achieved by defining our own table definition in accordance with the BigQuery writer used by Data Stream. In some case we can overwrite data types defined in BigQuery as well.

SQL snippet for reference as BigQuery customize table.

CREATE OR REPLACE TABLE  `projectname.partition_test.public_order_non_partition_onepkcolumn`
  (
  order_id INT64,
  ordered_at TIMESTAMP,
  shipped_at TIMESTAMP,
  order_amount FLOAT64,
  customer_id STRING(255),
  datastream_metadata STRUCT<uuid STRING, source_timestamp INT64>,
    PRIMARY KEY (order_id) NOT ENFORCED)
PARTITION BY
  DATE(ordered_at)
CLUSTER BY order_id
OPTIONS(max_staleness = INTERVAL 5 MINUTE);

Overwriting table definition might cause an exception and need to be tested. Below is sample table DDL overwritten in BigQuery before starting Streams with source as PostgreSQL and it works fine during my tests.

Additional notes, please be aware of Datastream failure that you might encounter while trying custom BigQuery definition.

BIGQUERY_WRITER_WRITE_FAILED
BIGQUERY_UNSUPPORTED_PRIMARY_KEY_CHANGE

It can be caused due to incorrect table definitions or column data type overwritten in BigQuery. Please take a reference table shared in the blog and if issues persist contact google cloud support.

Conclusion

Adding partition or altering column data type is a very common scenario of replicating to BigQuery using Datastream. With learning shared in the blog, we can leverage overwriting target table definition before starting streams job and test the replication.