Understanding dbt Incremental Strategies part 1/2

Bruno Souza de Lima
Indicium Engineering
10 min readAug 31, 2022

--

In this post I will talk about the dbt incremental strategies, how they work and what are the pros and cons of each one. For the next article, I will explain how to implement incremental strategies in your models in practice.

Using incremental models you can transform and insert into your tables only recent data, reducing (massively, depending on the size of the table) processing costs and time. However, before we dive into it, you have to ask yourself if you really need to use incremental models. Incremental materialization is an advanced and very powerful feature of dbt, however, you don’t need to use it in every and each model of your project. There are some points to take into account so you don’t change your model configuration and add a bunch of jinja conditions to your code for nothing.

Should I use incremental?

As I said at the beginning, you don’t need to use incremental in each and every model. In many cases, the full refresh strategy can be a better option, so let’s take some lines to talk about it. The full refresh is the default, and most common, process for transforming data into your DW. If you do not define your materialization as incremental, your table will be rebuilt entirely in each run.

Let’s assume we have the source table as shown below, and it holds data up to 2022–02–02. We want to update our destination table, which has data up to 2022–01–04. Using the full refresh strategy, dbt will first discard the current destination table (1), and then create a new destination table from the whole source transformed data (2).

Figure 1 — Full refresh example

It might sound like a not very optimized transformation process because it can spend a lot of time and processing power to create the table entirely, but it has its advantages.

First of all, in full refresh mode, there is a clear tradeoff between complexity and processing cost/time. It has low complexity, you don’t need to worry about any incremental rules and configurations, and you can be sure the destination table will receive all transformed data. On the other hand, you spend more time and money rebuilding the whole table. If your table is not very large, let’s say a few million rows or less, the transformations are not very costly, and you don’t have to worry about cost and time, there is no problem in sticking to full refresh.

About the change frequency of your data, if the data in the model doesn’t change that much, you don’t take much advantage of the incremental strategies, you can just make periodic full refreshes. And if old data of your model changes a lot, incremental strategies will struggle to capture it, because incremental worries about recent data, so a full refresh could be a better option.

Returning to the low complexity point and adopting a production point of view, full refresh can be used to make the destination table available to the final user quicker, generating value as soon as possible. In this way the final user is free to validate the transformation and give feedback, asking for changes or corrections. Besides that, having a transformed table in production provides the Analytics Engineer a more precise understanding of the DB, how and how frequently its data are updated, and how much data volume is processed in each run. These are very important points to be considered in an optimization step of the model. In the optimization step, strategies are evaluated to reduce the cost, processed volume, or processing time of transformations. This can be done in numerous ways, such as removing columns that don’t generate value or changing the model materialization.

We can summarize the key points of full refresh as:

  • Main advantages:
    - Implementation simplicity
    - Quicker value generation
    - Having the guarantee that all data will be inserted into the destination table, no matter which are the DB update rules
  • Main disadvantages:
    - High processing cost
    - High processing time
  • When to use:
    - When you don’t need to worry about cost and time (small tables or tables with simple transformations)
    - When data is more static than dynamic
    - When historical data changes frequently

But… if you need to reduce costs or make your transformation faster, or if your table receives new data frequently and you feel comfortable using dbt, you should try incremental models.

If you are yet a dbt beginner or do not know very well the dbt commands, check out my article about dbt commands https://medium.com/indiciumtech/17-dbt-commands-you-should-start-using-today-581998dbf8f0.

Incremental Models

The incremental materializations aim to reduce processing time and cost, transforming and inserting only the more recent data. For dbt to know which data are recent, you should have within your table a column of one of the following types: date, datetime, timestamp, or int64. The model will use the columns of the source and destination tables to filter which data to transform and insert, update, or delete.

A very important point to remember is that incremental strategies will not capture modifications made to old records. So, a good practice is to periodically run your table in full refresh mode to update historical data.

It is also good to mention that your table will run incrementally if three conditions are met: (1) the destination table already exists in the database, (2) dbt is not running in full-refresh mode (you are not using the--full-refreshflag) and (3) the running model is configured with materialized='incremental'.

dbt offers you four types of incremental strategies:

  • append
  • merge
  • delete+insert
  • insert_overwrite

The availability of the strategy depends on the adapter you are using. The dbt documentation lists three adapters along with their strategies.

  • Snowflake: merge (default), delete+insert (optional)
  • BigQuery: merge (default), insert_overwrite (optional)
  • Spark: append (default), insert_overwrite (optional), merge (optional, Delta-only)

Let’s talk about each one of them.

Append

The append strategy is very straightforward, it just takes the selected records and inserts them into the destination table. It can’t update or delete records, just insert. You can only use this strategy if duplicates aren’t a problem for you. Append does not worry about duplicates, it will not check if the record already exists in the destination table, it will just insert the duplicated record.

So, let’s imagine we have the same tables as the full refresh example. If we select the records from today and yesterday (1), for instance, suppose today is 2022–02–02. The rows will be inserted into the destination table (2) and we will have 2 rows duplicated.

Figure 2 — Append example

The key points of append are

  • Main advantages:
    - Simple and straightforward-
    - Low processing cost as it does not require scanning the destination table
    - Low processing time
  • Main disadvantages:
    - It can’t update or delete rows, just insert
    - Duplicates!!
  • When to use: When you are working with append-only tables
    - When you just want to insert rows and duplicates are not a problem
    - When you are working with append-only tables

Merge

The merge strategy solves the duplicated records problem. It can handle duplicated records if you specify a unique key (which may be composed by one or more columns). If the unique key already exists in the destination table, merge will update the record, so you will not have duplicates. And if the records don’t exist merge will insert them.

To check if the unique key of both tables matches, merge has to scan the whole destination table, as well as the selected portion of the source table. Perform this full scan causes this strategy to become very costly.

If you use the merge strategy without specifying a unique key, you are basically making an append. However, if you use BigQuery, for example, it is mandatory to use a unique key with merge.

Figure 3 — Merge example

Going back to our example, suppose today is 2022–02–02 and we want to merge the records from today and yesterday with merge. Dbt first gets the selected records specified by the where clause (1). Then it performs the scans (2), and if the records exist at the destination, they are updated, if they don’t exist they are inserted (3).

To increase the performance of merge and reduce costs, the destination table can be clustered. In this way, it will not be necessary to perform a full scan on the destination table.

Figure 4 — Merge clustered example

The key points of merge are

  • Main advantages:
    - Can handle duplicates
    - Works well for small tables (less than millions of rows)
  • Main disadvantages:
    - High processing cost as it has to scan the whole destination table
  • When to use:
    - When you have small tables
    - When you don’t want to have duplicates
    - When you want to run incrementally and your records are constantly updated

Delete+insert

The delete+insert strategy is very similar to merge, but instead of updating existing records and inserting new records, it deletes existing records and inserts both new and existing records. As it deletes the existing records in the destination table, you also don’t have duplicates.

Figure 5 — Delete+insert example

(Yes, I know, I am using this example for the whole article :), but I think this makes it easier to understand)

With delete+insert, dbt will get the selected records (1) and perform a scan on them and a full scan on the destination table to compare unique keys (2). Then it will delete (3) the existing records and insert the selected ones (4).

I can’t tell much more about this strategy because I haven’t found so much documentation or any benchmarks using it. But probably it will have a high cost because of the full scan.

Insert overwrite

The last incremental strategy I will talk about is the insert overwrite. This strategy solves the problem of full scan. The solution used by insert overwrite is to work with partitions.

Partitioning a table means dividing it into segments. For insert overwrite to work the partition can be of the following types: date, datetime, timestamp, int64. And the granularity can be hour, day, month or year. If you partition by int64 you have to specify a range. You must specify the column whose values will be used to form the partitions, and in the partition there will be the records that share the same granularity value.

The insert overwrite strategy deletes the selected partitions from the current destination table and inserts the selected transformed partitions into it.

Figure 6 — Insert overwrite example

In our example above, the destination table is partitioned by day (to improve performance the source table can also be partitioned). It is necessary, then, to indicate which partitions will be replaced in the destination table, and they are defined by a config parameter called ‘partitions’ (we will see it in more detail in the next post). In this example we want to replace the partitions from today and yesterday, supposing today is 2022–02–02.

We can use a where clause in our model specifying we want the records whose date are in our ‘partitions’ parameter, so the selected records will be from today and yesterday (1).

The partitions are then found in the destination table, and deleted, as part of a merge statement (2). As the table is partitioned, a full scan is not necessary. This is the main difference between insert overwrite and merge, and reduces drastically the volume of processed data.

Finally, the selected records are inserted into the destination table as part of a merge statement (3).

Except for the use of partitions, this process looks like the delete+insert strategy, however, the delete+insert strategy uses two separate statements, delete and insert. Insert overwrite uses the merge statement with a constant false predicate instead.

Insert overwrite is a strategy of high complexity. And if you do not set it right it can generate duplicates. Let’s imagine we have an employee table and we use the insert overwrite strategy based on the updated_at column.

Figure 7 — Problems with insert overwrite

If one month later employee 1 is updated, this employee will be inserted again into the destination table. A periodic full refresh solves this problem, but if you can’t wait for the full refresh to run, you should use another column or consider using the merge strategy.

We can summarize the key points of insert overwrite as:

  • Main advantages:
    - Low processing cost
    - Low processing time
  • Main disadvantages:
    - High complexity
    - Can generate duplicates if you do not set it right
  • When to use:
    - When you have large dynamic datasets
    - When cost and processing time are becoming a problem

Strategies comparison

Here is a comparison of the different strategies, delete+insert is not in it because I couldn’t find enough information about it. You can see a more detailed benchmark of the incremental strategies in this link https://discourse.getdbt.com/t/benchmarking-incremental-strategies-on-bigquery/981.

Figure 8 — Strategies Comparison

Final words

This is the end of the first part of the post about incremental strategies on dbt. Now you should have a clearer idea of when to use incremental and when not to. Moreover, you know how each strategy works and you are understanding when to use each one. In the next article I will show you how to implement incremental strategies in practice.

I hope you liked this post! The best way to learn about something is by trying to explain it to someone else, and that is what I am trying to do. Comment if you have any questions, suggestions, or even corrections to the text!

Check out the official documentation about incremental models below:

References:

--

--

Bruno Souza de Lima
Indicium Engineering

https://www.linkedin.com/in/brunoszdl/ #dbt #sql #snowflake #bigquery #databricks #analytics #analyticsengineer #data #elt