Bigquery Merge Optimization through Partition Pruning

Published in

Google Cloud - Community

3 min readJan 10, 2023

Lets understand Partition Pruning using the Merge Statement.

A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.

Example:

Consider a product table with primary key as product_id and partitioned by launch_date.

The example query below, merges products from the staging_product table into the target_product table. If a product is already present in target_product table, then the query updates the product_price field. Otherwise, the query inserts a new product into the target table.

Assume that the staging_product table contains product_id that falls into several partitions. (launch_dates)

Original SQL:

MERGE product_tgt T
USING product_stg S
ON T.product_id = S.product_id and
T.launch_date = S.launch_date
WHEN MATCHED THEN UPDATE
SET T.product_price = S.product_price
WHEN NOT MATCHED THEN
INSERT (product_id,product_price,launch_date) values (S.product_id,S.product_price,S.launch_date);

Here, the partition filter is dynamic, which means that the Bigquery needs to compute the partition during the runtime based on the launch_date.

However BigQuery cannot prune partitions based on dynamic predicates, i.e. values computed during the query execution

The query can be optimized as below:

Optimized SQL:

DECLARE src_range STRUCT<date_min DATE, date_max DATE>;
SET src_range =(SELECT STRUCT(MIN(launch_date) AS date_min,MAX(launch_date) AS date_max) FROM product_stg );
MERGE product_tgt T
USING product_stg S
ON T.Product_Id = S.Product_Id and
T.launch_date >= src_range.date_min and T.launch_date <=src_range.date_max
WHEN MATCHED THEN UPDATE
SET T.Product_Price = S.Product_Price
WHEN NOT MATCHED THEN
INSERT (Product_Id,Product_Price,Launch_Date) values (S.Product_id,S.Product_Price,S.Launch_Date);

BigQuery prune partitions based on static predicates. Hence we have precomputed date_min, data_max in a separate query and provide those as static values in MERGE SQL.

min and max technique can be applied for both date and integer based partition column.

Performance Results:

Lets take some sample data for staging and target table and compare the query scan bytes for both the approaches.

Consider, that the staging table contains 1 million records that needs to be merged with target table, which contains 7 million records. Both tables are partitioned by launch_date with primary key as product_id.

Approach 1: Un-optimized Merge

Approach 2: Optimized Merge

Conclusion:

As we can clearly see from both the approaches: The Bytes scanned, Bytes Shuffled, Slot time consumed and Elapsed Time from Approach 1 is higher than the Approach 2.

This is due to the fact that the BigQuery prune partitions based on static predicates rather than dynamic predicates.

This concepts applies not only for merge operations but also for other operations like join, filter etc

I frequently write about Google Cloud Technologies and Optimization Techniques. Feel free to follow me for future articles.

Appendix:

My other articles from Bigquery for your reference:

Long Running Query Optimization Guide for Bigquery

Bigquery Performance Tuning

medium.com

Convert NORMALIZE (Teradata) to Bigquery

Teradata Complex Function Conversion