Looking back in time with DBT Snapshots

Daniël Mast
6 min readApr 14, 2023

--

Every data professional knows that source data is seldom ready for use without some form of preprocessing. One of our clients faced a similar situation when incorporating customer data from another part of the organization. The data was available on the central data warehouse, and was updated every day to represent the current customer state, but the updated_at timestamp column didn’t accurately reflect when a record had been modified. Correspondence with the table’s data owner confirmed that the issue couldn’t be resolved at the source. Therefore, the consuming team needed to find a solution to detect modified and deleted records from a mutable table. It appeared that within the current tool stack, of which DBT was part, a perfect solution was available: DBT Snaphots. This post details how DBT Snapshots was used to detect and process changes in source tables.

What is DBT, and what does the Snapshots feature do?

DBT, or Data Build Tool, is an open-source command-line tool for building data transformation pipelines. It is becoming increasingly popular among data engineers and data analysts for its ability to transform and model data efficiently, while maintaining strong data governance and testing capabilities.

One of the key features of DBT is its ability to create Snapshots. DBT Snapshots are a way to capture a point-in-time view of your data and store it as a table in your data warehouse. This can be incredibly useful for creating data reports and analytics, as it allows you to analyze how data has changed over time and compare different versions of the same data. Snapshots can also be used to create historical records of data, which can be useful for compliance and audit purposes. We will explore the various use cases for Snapshots and provide step-by-step instructions on how to implement them in your own DBT project.

Type-2 Slowly Changing Dimensions

DBT Snapshots implement Type-2 slowly changing dimension over mutable source tables. Type-2 slowly changing dimensions, also known as Type-2 SCDs, are a method of tracking changes to dimensional data over time. In data warehousing and data modeling, dimensions are attributes used to describe the characteristics of data. For example, in a sales data warehouse, product, customer, and time are all common dimensions.

A slowly changing dimension is one that — you guessed it — changes slowly over time. For example, a customer’s address might change occasionally, but not every day. Type-2 SCD is used to track these types of changes in the dimensional data.

In Type-2 SCD, a new record is created every time a change is made to a slowly changing dimension. The record includes a start date and an end date to track when the change occurred. This allows for historical analysis of the data, allowing you to see the values of the attribute at different points in time.

For example, let’s say a customer moved from Amsterdam to Rotterdam. Instead of updating the original customer record, a new record is created with the start date of when the customer moved to Rotterdam and an end date of when the old address was no longer valid. This allows for reporting on the customer’s address over time, and enables analysis of customer behavior based on their location.

Example of a Type-2 SCD methodology

Type-2 SCDs are a powerful tool for data warehousing and data modeling, as they allow for accurate reporting and analysis of data that changes over time.

Let’s see how DBT Snapshots implement Type-2 SCDs.

snapshots/customers.sql

{% snapshot customers_snapshot %}

{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at'
)
}}

select * from {{ source('cms', 'customers') }}

{% endsnapshot %}

This example expects that in the customers table, a column updated_at exists, with a timestamp that is updated when the record has changed. This is the easiest update detection strategy.

To create the snapshot table, run:

dbt snapshot --select customers_snapshot

But what if the source table does not have such a column, or the timestamps in the column are unreliable? In that case, the check strategy can be used:

snapshots/customers_check.sql

{% snapshot customers_snapshot %}

{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='check',
check_cols=['customer_name', 'customer_city']
)
}}

select * from {{ source('cms', 'customers') }}

{% endsnapshot %}

The check strategy will check for each record if any value in check_cols has changed, and create a new snapshot record for it.

Detecting deleted records

By default, Snapshots does not detect a deleted record from the source table. So if a customer is completely removed, the snapshot table will not update any records of that customer. This behavior can however be enabled with the invalid_hard_deletes config:

snapshots/customers_hard_deletes.sql

{% snapshot customers_snapshot %}

{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='check',
check_cols=['customer_name', 'customer_city'],
invalidate_hard_deletes=True,
)
}}

select * from {{ source('cms', 'customers') }}

{% endsnapshot %}

When our customer is removed, the snapshot table will look as follows:

Example of the snapshot table after a hard delete

The dbt_valid_to column of the most recent snapshot was NULL , and is updated to the current date. If you want to search for deleted customers in this snapshot table, use the following query:

select *
from customers_snapshot
where dbt_valid_to is not null
qualify row_number() over (
partition by customer_id
order by dbt_valid_from desc
) = 1;

This query picks the most recent record per customer, and checks if it has a dbt_valid_to date. If it does, it means that the customer is deleted.

Best practices and pitfalls

To successfully use DBT Snapshots, it is important to take a few guidelines into account.

Firstly, think about when the snapshots should be created. Most data pipelines use some form of scheduling to update the models periodically. For example, an Airflow job runs dbt run on all the models in the project. When those models use snapshots as a source, make sure to run dbt snapshot in the same scheduled job, right before the dbt run .

Secondly, decide what columns to select from the source. select * may feel like bad practice. The docs do encourage this however, because you might never know which column’s history you will need to access later. DBT Snapshots handles it perfectly when the number of columns in the source table suddenly changes, so using select * is perfectly reasonable. New columns in the source table will be added to the snapshot table automatically (setting the already existing records to NULL for that column). When a column is removed from the source table, the column will remain present in the snapshot table (setting the future records to NULL for that column).

Thirdly, make sure that the updated_at column that you select contains a valid timestamp, and never contains NULL. Otherwise it will lead to unexpected results.

Fourthly, make sure to define a reliable unique_key in the config. If the source table does not have a unique key, construct one in the select statement, or make sure only one record in a group of duplicates is selected (e.g. using QUALIFY).

The last best practice should not come as a surprise: Write tests! Anything that can be expressed as an SQL query, can be used as a test in DBT. For example, write a test to make sure the unique key in your source or snapshot table is actually unique.

Conclusion

DBT Snapshots is a powerful feature for capturing changes in mutable source tables, and building up a history of those changes. Snapshots proved to be the perfect solution for the challenge at our client. Along with the other features of DBT, this can greatly benefit the transformations in your data warehouse.

Sources:

--

--