dbt Incremental: Choosing the Right Strategy — P1

Alice Bui
Joon Solutions Global
4 min readSep 13, 2023

Incremental materialization is an advanced and powerful feature of dbt. However, it’s important to note that you don’t need to use it in every single model of your project. Understanding when and how to leverage incremental materialization can be a game-changer for your data workflows.

This blog post will delve into the benefits and trade-offs of incremental and full refresh approaches. Our goal is to guide you through the decision-making process, helping you maximize the efficiency of your data models.

1. Incremental Strategies & When to Use

1.1. Incremental vs. Full Refresh

When using the full refresh strategy, dbt will discard the current destination table and create a new one from the entire source transformed data. On the other hand, with incremental models, you can transform and insert only the recent data into your tables.

When choosing between Incremental and Full Refresh, there are several factors to consider:

The tradeoff between complexity and processing cost/time:

  • Full Refresh has low complexity; you don’t need to worry about any incremental rules and configurations, but rebuilding the whole table can take more time and cost more money. However, the transformations are not very costly if the table is not large (a few million rows or less). There is no problem in sticking to full refresh.
  • Incremental is helpful if data tables have millions, or even billions, of rows or the transformations on the data are computationally expensive (that is, take a long time to execute), for example, complex Regex functions, or UDFs are being used to transform data.

The data change frequency:

  • If the data in the model doesn’t change that much, you don’t take much advantage of the incremental strategies, you can make periodic full refreshes.
  • But suppose the historical data of the model changes a lot. The incremental strategy will not help capture the change because it only worries about recent data. In that case, a full refresh could be a better option.

1.2. Incremental Strategies

1.2.1. Append

The append strategy is very straightforward; it just inserts the selected records into the destination table. It can’t update or delete records, just insert. This strategy is suitable when duplicates are not a concern.

[Source: Understanding dbt Incremental Strategies part 1/2] Append

1.2.1. Merge & Merge with clustered

The merge strategy solves the problem of duplicate records. If the unique key already exists in the destination table, the merge will update the record. And if the records don’t exist, merge will insert them.

[Source: Understanding dbt Incremental Strategies part 1/2] Merge

To check if the unique key of both tables matches, merge has to scan the whole destination table. Performing this full scan can be very costly. The destination table can be clustered to increase the merge performance and reduce costs. In this way, performing a full scan on the destination table will not be necessary.

[Source: Understanding dbt Incremental Strategies part 1/2] Merge with clustered

1.2.3. Delete+insert

The delete+insert strategy is very similar to merge, but instead of updating existing records and inserting new records, it deletes existing records and inserts both new and existing records.

[Source: Understanding dbt Incremental Strategies part 1/2] Delete+insert

1.2.4. Insert+overwrite (with partitioned)

This strategy solves the problem of a full scan. The solution used by insert overwrite is to work with partitions. For insert overwrite to perform the partition can be of the following types: date, datetime, timestamp, int64. But if the timestamp is skewed, Insert+overwrite is not an ideal solution.

The insert overwrite strategy deletes the selected partitions from the current destination table and inserts the selected transformed partitions into it.

[Source: Understanding dbt Incremental Strategies part 1/2] Insert+overwrite

However, it can generate duplicates if you do not set it right. A periodic full refresh solves this problem, but if you can’t wait for the entire refresh to run, you should use another column or consider using the merge strategy.

1.3. Decision Tree: Choosing the Right Strategy

Based on the pros and cons mentioned above and the cons of each strategy, I make the decision tree to help you decide the most optimal materialization.

Decision Tree for Choosing the Right Strategy

In practice, I think the 2 most optimal (avoiding full scan & duplicates) incremental strategies are:

- Merge with clustered

- Insert+overwrite with partitioned (only use when timestamp is not skewed)

Thanks for reading! In the next blog, I will share how I implement and test incremental in practice: dbt Incremental: Implementing & Testing — P2

--

--

Alice Bui
Joon Solutions Global

Analytics Engineer @ Joon Solutions | GDE, dbt, Looker, Airflow Certified