Difficulties of dbt incremental model: how to eliminate environmental differences in tables

Published in

Towards Data Engineering

3 min readFeb 22, 2024

Despite the convenience of dbt incremental model, which allows us to update large tables in a variety of ways, there are countless pitfalls. One of them is the difficulty of verifying the impact of modifications in the development environment.

While ordinary tables and views can be verified simply by executing dbt build command in the development environment, the incremental model often causes errors in the release process in the production environment, even if the model passes tests in the development environment.

In most cases, the error is caused by not testing with the same data as in the production environment, which can be prevented if the same data is available. However, many incremental models require periodic builds, so it is likely that simply re-building once during development cannot produce the same data. To ensure the consistency of data, you can copy the data from the production environment, and this article will show you how to do it.

How to implement with Python

We use python this time because it can be implemented quicker than writing a marco within dbt repository.

First, we create SQL to retrieve a list of target tables. We will use the method to get the list of tables created in the incremental model from the output of the dbt_project_evaluator, which is a dbt package collecting metadata for aligning with dbt best practice.

SELECT
    database AS project,
    schema AS dataset,
    COALESCE(alias, name) AS table_name 
FROM
    `{project}.{dataset}.stg_nodes`
WHERE
    materialized = "incremental"
    AND package_name = {your_package_name}
    AND is_enabled

If you use another package called elementary, which is designed to monitor the data pipeline, your query should be like:

SELECT
    database_name AS project,
    schema_name AS dataset,
    alias AS table_name,
FROM
  `{project}.{dataset}.dbt_models`
WHERE
    materialization = "incremental"
    AND package_name = {your_package_name}

Controlling dbt tag usage with elementary

Although dbt tags are very powerful for controlling jobs, we do not want the number of tags to increase too much. While…

medium.com

The acquired data should be converted to a Dataframe. Next, specify the datasets in the development environment to be copied to.

DESTINATION_DATASETS = ["tanuki", "tiger", "rabbit"]

Loop through each of the specified data sets as follows.

from google.cloud import bigquery

client = bigquery.Client(project = {your_gcp_dev_project_id})
 
for destination_dataset in DESTINATION_DATASETS:
    print(f"copying tables/views to {destination_dataset}")
 
    # loop by dataframe rows
    for _, table in df.iterrows():
     
        # Copy the filtered tables/views to the destination dataset
        source_table_id = f"{table['project']}.{table['dataset']}.{table['table_name']}"
        destination_table_id = f"{your_gcp_dev_project_id}.{destination_dataset}.{table['table_name']}"

        query = f"""
            CREATE OR REPLACE TABLE `{destination_table_id}`
            CLONE `{source_table_id}`
        """

        job_config = bigquery.job.QueryJobConfig()
        job = client.query(query, job_config=job_config)
        
        # Wait for the copy job to complete
        job.result()
        print(f"Table/view {source_table_id} copied to {destination_table_id}")

Note that we use table clone, which is recently GA in BigQuery. This feature cost-effectively allows environment synchronization, since storage costs are only incurred for the difference from the original table.

No matter how small the storage costs are for a cloud DWH, if you are copying TB-class tables as a physical table to many datasets, you may receive a message from the cloud administrator like this:

👮🏼‍♀️ “Does storage cost in your project rapidly increase lately?”

Therefore, you need to be careful when duplicating large data sets.

Alternatives

If you want to copy tables in dbt repository not with python, you can use the pre_hook function and call macro before building. If you want to run only in the development environment, you can also use the environment separation condition in dbt, such as the target variable.

example code:

{% macro clone_table(source_table_id, destination_table_id) %}
    {% if target.name == 'dev' %}
        CREATE OR REPLACE TABLE `{{ destination_table_id }}`
        CLONE `{{ source_table_id }}`
    {% endif %}
{% endmacro %}

In addition, the dbt clone command has been added since dbt core 1.6. As introduced in the link, this command should be sufficient for building a CI environment when creating pull requests.

Clone incremental models as the first step of your CI job | dbt Developer Hub

Learn how to define clone incremental models as the first step of your CI job.

docs.getdbt.com