Understanding dbt Incremental Strategies part 2/2

Bruno Souza de Lima
Indicium Engineering

--

In the last post, you were presented with the dbt incremental strategies, the vantages and disadvantages of each one, and when to use them. To summarize it, we answered the “What?”, “Why?”, and “When?” questions.

In this post it is assumed you have already read the first one and already understand the incremental strategies concepts and you can answer the above questions by yourself. If you have not read the last post, you can check it out here.

In this second part of dbt incremental strategies I will show you how to implement these strategies in dbt, in other words, we will answer the “How?” question.

Let’s dive into it! 🏊

What is common for all strategies?

First of all, let’s see what is common for all incremental strategies, and which parts of the code you will use in all of them.

Remember what dbt checks to say a model is incremental. These three points are mandatory for all incremental strategies:

  • The table must already exist in the database. So the first time you run it, it will be in full refresh mode.
  • You are not running your model in full refresh mode. Simply do not use the --full-refresh flag in your dbt run command.
  • The materialized configuration of the model is set to ‘incremental’. Let’s see how to do it right now!

Configuring materialized = ‘incremental’

Configurations can be set in three different places. In a config() block in your model, a .yml file or in the dbt_project.yml file.

If you have conflicting configs, dbt will use the first one that appear in this order: config() block -> .yml file -> dbt_project.yml.

I suggest always configuring your incremental model in the config() block of your file. It allows you to set specific parameters per model, such as unique keys or partitions to replace.

What ELSE is common for all strategies?

There are a few more things relevant to all incremental strategies which are important to be understood.

The ‘where’ condition

When you are working with any incremental strategies, you want to insert only a portion of the data.

You can filter this portion of the data inside the where condition. And there are two ways to do that:

  • Dynamically
  • Statically

The incremental model will reduce significantly the processed data if you use in the where condition the partitioning column of your source table.
Then it will not be necessary to perform a full scan in the source table to find the data that match the condition.

Dynamically — Using the Jinja function {{ this }}

A dynamic ‘where’ condition means that you will check the value for your ‘where’ condition in your existing table.

For example, let’s say you have an orders table in your database with a column named ‘order_datetime’.

And you have a source table with more recent data.

Today is 2022–09–26 and you want to incrementally insert data from your source table to your destination table. But you want to insert only the data in which the order_datetime is greater than in your destination table.

Let’s use values to simplify, in our destination table, the most recent data is from 2022–09–21 21:15:02. You want to run our model incrementally and you want to insert only the orders with ‘order_datetime’ greater than 2022–09–21 21:15:02. You want to insert only the orders with ‘order_datetime’ > max(‘order_datetime’) of your existing table.

So you have to reference the table itself in your model. You can do it using the jinja function {{ this }}.

In this way, dbt will select from the source table, only data fresher than the data in your existing (destination) table.

But be cautions choosing the columns you will check. If you have a late arrival at your table, like the order_id = 7 in the table below, your model will not catch it.

So you better change your incremental model to check the order_ingestion_datetime.

So, this way of passing the ‘where’ condition is called dynamic because you don’t know the value you are using for the condition, the sql query will find it out for you dynamically.

Now let’s see the static way.

Statically — Passing a variable

For building your model with a static ‘where’ condition, you can use controlled values. I disagree they are static because they will actually change. But they don’t depend on the data from your destination table and you can tell what value it is just by looking at the code.

In this case, instead of comparing data from the source and destination table, let’s say we want to insert the data from today.

When I say today, I say the current date, the date when the model is being run. So it will not be the same value if you run it today or tomorrow.

Using the same table examples as before, if you want to insert data from today, you can write.

If you want to insert data from today and yesterday, you can write

And if you want to insert data from today and the last week, you can write.

BUT WAIT…

This does not seem right, your code is getting too big, and if you want to use this condition in several models you will have to write it all again

If you want to change the time window you will have to change all your models…

So, a good practice is to put all of this inside a variable inside your dbt_project.yml

And call the variable in your model using this jinja function.

Now the code looks much cleaner!

Using the function is_incremental()

Using the examples above, if you try to run this model in full refresh mode, you CAN’T. It will still just insert the most recent data because of your ‘where’ clause.

And how can write a model which can be used in both incremental and full refresh mode? For that, you should always put the incremental parts of your code inside a {% if is_incremental() %} statement.

dbt will enter the if statement if the function is_incremental() returns true. It will return true if your model passes the three checks mentioned at the beginning of the post. But it is never a bad idea to read them again:

  • The table must exist in the database already
  • You are not running your model in full refresh mode (be sure you are not using the –full-refresh flag)
  • The materialized configuration of the model is set to ‘incremental’

So we can write our last example as

Now if we run our model with the --full-refresh flag, it will not run line 8 of the code and it will select all the rows of the table, doing a full refresh.

If we run our model without the --full-refresh flag, it will run line 8 of the code and it will select only the data we specified.

The if statement with the is_incremental() function should ALWAYS be used. I showed the example with a static where condition, but it works with dynamic conditions as well.

Incremental Strategies

Now let’s see the particularities of each strategy.

Strategies vary depending on the database.
Each database can use a different version of SQL.
In my examples I am always using the BigQuery version of SQL, so the code may not run if you don’t translate the functions toyour SQL version.

Default strategy

If you don’t specify a strategy in your config file, dbt will use the default strategy for your adapter.

The default strategies and other possible strategies for each adapter can be seen here.

  • Snowflake: merge (default), delete+insert (optional)
  • BigQuery: merge (default), insert_overwrite (optional)
  • Spark: append (default), insert_overwrite (optional), merge (optional, Delta-only)

You can also check the documentation for adapter-specific configs, if your adapter was not listed above.

Append

The append strategy is very straightforward. There is no need for any additional configuration other than the incremental_strategy. If the append strategy is your default you don’t need to explicitly declare it.

You just have to be cautious about duplicates. Using the following destination and source tables as examples

If you run the model incrementally, assuming today is 2022–09–26 the destination table would have duplicates:

Merge

The merge strategy is the default one for some databases. If this is your case, you don’t need to configure it explicitly.

For the merge strategy, besides the incremental_strategy, you can define a unique_key. Merge will check the unique keys of the data you selected and the data in your destination table, then it will update rows with the same unique key, and insert data with new unique keys.

In our case, the unique key would be the order_id .

So let’s assume our existing table is

And in our source table, for some reason, the order_status and order_payment_method of some rows changed.

If you run the model, this will be the result:

The 2nd and 3rd row were updated, and from the 4th they were inserted.

Note that the 1st row was not updated because it doesn’t match our where condition.

For using merge you have to understand the behavior of your source table. If old records are updated frequently, how old can they be, and if the updated information is important.

If old records are updated you can consider using a larger time window or not using a where condition, or even using full refresh. If the updated information is not crucial, you can use merge and run periodic full refreshes.

Another configuration we can specify is the ‘ merge_update_columns’. Let’s imagine, for some reason, you want to update only the order_status, you don’t want to update any other row. You can pass this column inside a list:

Then, the result would be

Delete+insert

The delete+insert strategy is similar to the merge configuration, you have to configure a unique_key as well.

The difference betweendelete+insert and merge is that merge will update existing rows, while delete+insert will, as the name suggests, delete them and insert them again. Rows with new unique keys will just be inserted.

Insert Overwrite

For the insert overwrite strategy you must partition your table with the partition_by config, and specify which partitions you want to overwrite using the partitions config.

In partition_by we must specify the partition column (field) and the type of the column (data_type). In my last post I showed which are the types allowed.

In partitions we specify the partitions we want to overwrite and usually, they are the same partitions we are selecting in our where condition. So, you can use the same variable here.

So, assuming the following destination and source tables as examples

dbt would delete from the destination table all data ingested today or in the last week and replace it. The destination table would be then

What if I want to change the columns in my incremental model?

Sometimes it can happen that your source tables need to change. You could have to remove a column or insert a new column, or even change the column data type.

dbt already provides a feature to deal with this problem. It is actually a configuration you can put inside your config() block. And it is called on_schema_change .

There are four options for on_schema_change:

  • ignore
  • fail
  • append_new_columns
  • sync_all_columns

Let’s see the behavior of each one. For that, let’s assume this is our destination table:

And this one is our source table:

Note that we removed the order_payment_methodcolumn and added the order_credit_cardcolumn.

And we are also assuming we are using the merge strategy.

ignore

This is the default value for on_schema_change.

If we run it, dbt will ignore the changes, in other words, it will not include the new column and neither removes old columns.

fail

With the failoption, dbt will raise an error message if the columns change.

The fail message will be as follow:

The source and target schemas on this incremental model are out of sync!
They can be reconciled in several ways:
— set the `on_schema_change` config to either append_new_columns or sync_all_columns, depending on your situation.
— Re-run the incremental model with `full_refresh: True` to update the target schema.
— update the schema manually and re-run the process.
Additional troubleshooting context:
Source columns not in target: order_credit_card
Target columns not in source: order_payment_method

append_new_columns

Using append_new_columns, dbt will add new columns but not remove old columns.

The result will be:

sync_all_columns

The last option, sync_all_columns, will remove old columns and include new columns

Check out the result:

Full refresh

Note that when we included the new column, not all values were backfilled. The rows outside the select were not updated.

If you want to sync all value from the new column you must run the model in full refres mode.

Closing remarks

So, this was the second half of the Understanding dbt incremental strategies story and now you have a full picture of them!! 😄

In the course of these two posts, we answered in a very detailed way the questions:

  • WHAT is a dbt incremental strategy?
  • WHAT are the incremental strategies?
  • WHY use or not use incremental strategies?
  • WHEN use and when not to use each incremental strategy?
  • HOW use each incremental strategy?

I hope now you are able to implement with efficiency and confidence incremental strategies in your project when you judge it is necessary.

If you have any questions, suggestions, something to add, or something to correct me, please leave a comment below! 👇

And remember that the best way to learn something is by trying to explain it to someone else!

References

--

--

Bruno Souza de Lima
Indicium Engineering

https://www.linkedin.com/in/brunoszdl/ #dbt #sql #snowflake #bigquery #databricks #analytics #analyticsengineer #data #elt