Creating InPlace Delta Loads in BigQuery: A Step-by-Step Guide

Published in

Google Cloud - Community

3 min readJul 24, 2024

Introduction

Delta loads are a common data engineering pattern where only changes to a dataset are loaded into a target system. This approach is efficient and can significantly improve performance compared to full data loads. In this post, we’ll explore how to implement delta loads in BigQuery using a combination of temporary tables, MERGE statements, and materialized views.

A key advantage of this approach is that it eliminates the need for separate delta load and complete data tables, simplifying the data management process.

Understanding the Code

Before diving into the code, let’s break down the logic:

1. Create a temporary table:

A temporary table tmp is created to hold the current state of the data.
Each row in tmp is assigned a unique hashed value (a JSON string of the entire row) for efficient comparison.
A del_flag column is added to track deleted records.

2. Create or update the target table:

If the target_table doesn't exist; it's created with the same schema as tmp.
Existing records in target_table are marked as deleted by setting del_flag to 1.

3. Merge data into the target table:

A MERGE statement is used to reconcile the data in tmp with the target_table.
New records from tmp are inserted into target_table.
Existing records in target_table that are present in tmp are marked as deleted.

4. Create a materialized view for delta output:

A materialized view delta_mv is created to select only the changed records (where del_flag is not 1).

Code Breakdown

SQL

-- Create a temporary table with hashed and del_flag columns
create temporary table tmp as
select to_json_string(t) hashed, 0 del_flag, T.* 
from (select * from logictable) T;
-- Create or update the target table
create table if not exists target_table like tmp;
update target_table set del_flag=1 where del_flag<>1;
-- Merge data into the target table
MERGE target_table T USING tmp S ON 
farm_fingerprint(S.hashed)=farm_fingerprint(T.hashed) 
when not matched then insert row 
when matched then update set del_flag=1;
-- Create a materialized view for delta output
create materialized view delta_mv as
select * except(del_flag,hashed) from target_table where del_flag<>1;

Use code with caution.

Key Points

The hashed column is crucial for efficient comparison and updating records.
The del_flag column is used to track deleted records and filter them out in the materialized view.
The MERGE statement is a powerful tool for handling updates and inserts in a single operation.
Materialized views provide an efficient way to access the delta data.
Farm Fingerprint Function: It is more efficient in comparison and much faster when you have large data fields.
No separate delta load or complete data tables are required. It target_table always contains the complete data while providing the changes since the last load.

Additional Considerations

Performance: For large datasets, consider partitioning the target table and materialized view to improve query performance.
Data retention: Define a retention policy for the target table to avoid excessive data growth.
Error handling: Implement error handling mechanisms to ensure data integrity.
Incremental loads: For continuous data ingestion, you can schedule the delta load process to run periodically.

Conclusion

By following these steps and considering the additional points, you can effectively implement delta loads in BigQuery to improve data processing efficiency and reduce costs. This approach is particularly useful when dealing with large datasets that experience frequent changes. The elimination of separate delta load and complete data tables simplifies data management and reduces storage overhead.