Optimize Your dbt Incremental Models with Teradata Vantage™ Temporal Tables

Published in

Teradata

10 min readJul 10, 2023

The diagram depicts the behavior of standard models in dbt and highlights the differences compared to incremental models

What are Incremental Models in dbt?

Data build tool (dbt) is an open-source tool used to simplify the building of data transformation pipelines within a data warehouse/lake environment. dbt enables the modularization, orchestration, automatic documentation, and testing of each transformation by representing them as dbt models.

In essence, dbt models are Structured Query Language (SQL) scripts typically written as Common Table Expressions (CTEs). These models reference data sources, or other dbt models, to retrieve specific data, apply transformations to the acquired data, and materialize the result as a database artifact, such as a table or view. The materialization type is defined in the model configuration, which can be set at either the project or individual model level.

While tables and views serve as the primary materialization types in dbt, there are two additional special types of materializations. Ephemeral materializations, which do not materialize as database artifacts, and incremental materializations, a subset of table materializations that we explore in detail in this article.

By default, dbt transforms all data within the source or ancestor model and fully rebuilds the model materialization in each run. Incremental materializations offer two ways to tailor this default behavior. Firstly, they allow filtering of the data ingested from the source after the initial materialization, typically by filtering only newly added or updated data. Secondly, they provide a mechanism to define a strategy for handling previously transformed data in the event of updates. Models that adopt incremental materialization are known as incremental models.

When Should You Implement Incremental Models?

The general behavior of dbt materializations might not be ideal when:

Working with large datasets, due to performance concerns.
The data in the source, or ancestor model, contains only the latest modifications or additions to a certain dataset. In this case, the materialized table will always overwrite data transformed in the previous materialization, leading to data loss.

Incremental models tackle these two problems by providing the ability to selectively filter the data to be transformed by the model. Additionally, they incorporate the newly transformed data into the existing model materialization if one already exists.

Below we review the implementation of two incremental models in a project. We leverage Teradata Vantage support for Temporal Tables, specifically transaction-time columns, to easily filter for new additions or modifications in the source data to update our incremental models.

Example Walkthrough

Project Demo Repository:

https://github.com/Teradata/teddy-retailers-dbt-temporal-tables

Prerequisites:

Access to a Teradata Vantage Instance: You can easily access a Teradata Vantage instance for free through ClearScape Analytics Experience.
Your preferred database client for connecting to your Teradata Vantage instance.
Python 3.9–3.11.

Business Context:

We will explore the Extract Load and Transform (ELT) process of a fictional store called Teddy Retailers, specifically focusing on ingesting data into an incremental model by filtering for data ingested after the previous dbt run. Teradata Vantage Temporal Tables play a crucial role in achieving this objective.

The initial source data comprises tables such as `customers`, `orders`, `products`, and `order_products`, as depicted in the Entity Relations Diagram (ERD) of Teddy Retailers.

After applying transformations using dbt, a star schema is created from the original sources. In the star schema a fact table called `fct_order_details` holds all the numerical data pertaining to orders (facts) and incorporates foreign keys to establish connections with various dimensions such as orders, products, and customers, enabling analysis of the data.

ERD of Teddy Retailers Data Mart Star Schema

Regular updates occur in two data sources, specifically `sources_orders` and `source_order_products`. These sources leverage Teradata Vantage’s support for temporal tables, utilizing the transaction-time columns `order_duration` and `order_product_duration`. Transaction-time columns automatically record timestamps indicating the period during which the data was inserted or updated.

Transaction-time columns are defined by including “AS TRANSACTIONTIME” in the column definition and are associated with a period data type.

Leveraging these timestamps in conjunction with dbt enables the ingestion of only newly inserted or updated data into our dbt models. Filtering the data in this way improves the performance of the model execution when working with big datasets.

We have designed two incremental models, `all_orders` and `all_order_products`, as an integral part of our data warehouse architecture. These models are the most important source for the factual table of Teddy Retailers Star Schema.

Setting Up Development Environment

Clone the tutorial repository:

git clone https://github.com/Teradata/teddy-retailers-dbt-temporal-tables teddy-teddy-retailers

Navigate to the cloned repository directory:

cd teddy-teddy-retailers

Create a Python virtual environment:

python -m venv venv

Activate the virtual environment according to your Operating System.
Install dbt-teradata:

pip install dbt-teradata

Create the profile “teddy_retailers” in your `~/.dbt/profiles.yml` file as detailed below:

If the directory .dbt doesn’t exist in your home directory, you will need to create both the directory and the profiles.yml file.

teddy_retailers:
  outputs:
    dev:
      type: teradata
      host: <host of your Vantage Instance>
      user: <user name of your Vantage Instance>
      password: <password used in your Vantage Instance>
      logmech: TD2
      schema: teddy_retailers
      tmode: ANSI
      threads: 1
      timeout_seconds: 300
      priority: interactive
      retries: 1
  target: dev

Run dbt’s debug command to ensure the connection to your data warehouse is functioning correctly:

dbt debug

Loading Baseline Data to Sources

To simulate a production environment, we offer publicly available datasets along with corresponding scripts to import them into your test data warehouse. To load the data, you can copy and execute the SQL script located at `references/inserts/create_data.sql` using your preferred database client.

This script loads data from the provided datasets. It also alters the tables `source_orders` and `source_order_products` to include the temporal transaction period fields, `order_duration`, `order_product_duration`, respectively. These fields are used in the corresponding incremental dbt models for filtering the data and transforming only newly added or modified records.

The Incremental Models

To attain the objectives of incremental materializations incremental models should define the following:

How to filter the data in the source or ancestor model.
How to incorporate newly transformed data into an already materialized model.

Filtering of data in the source or antecedent model:

When working with large datasets the objective is usually to only transform data that has been modified or added to the source, or ancestor model, since the last transformation took place. To fulfill this requirement, the source, or ancestor model, should contain a column that tracks modifications or additions with a timestamp. Teradata Vantage temporal fields offer a very convenient way to implement such a timestamp.

The main sources in our scenario, `source_orders` and `source_order_products`, are implemented as temporal tables. Filtering for newly added or modified records is merely a matter of adding a `WHERE` clause to the `SELECT` statement that reads data from the source, or ancestor model, as seen below:

 WHERE BEGIN(order_duration) > (SELECT MAX(BEGIN(order_duration)) FROM {{ this }})

This clause, however, must be evaluated only when a previous materialization already exists in the data warehouse. The `is_incremental()` macro offered by dbt, returns true when this is the case. The `SELECT` statement for acquiring the data from the source, or ancestor model, should be implemented as below:

 SELECT * FROM {{ ref('stg_orders') }}
  {% if is_incremental() %}

  -- this filter will only be applied on an incremental run
  WHERE BEGIN(order_duration) > (SELECT MAX(BEGIN(order_duration)) FROM {{ this }})

  {% endif %}

Incorporating newly transformed data into an already materialized model

The insertion and edition of data into an incremental model are driven by strategies defined in the model configuration. There are three main strategies for this:

Append: The most basic strategy, all new data is appended to the model. This could lead to duplications in the case of modified data. The model design should manage these occurrences to fulfill the model’s specific requirements.
Delete + Insert: A primary key should be defined for managing duplications. Duplicates are automatically managed based on the primary key. If the primary key already exists in the target model, the record is deleted and the modified data in the source or antecedent model is inserted.
Merge: A primary key should be defined for managing duplications. Duplicates are managed by modifying the existing record in the target model with the modified data in the source or antecedent model.

In the case of our example:

The `all_orders` model adopts a delete + insert materialization approach. The status of orders can undergo changes. These changes are reflected in the source.
A single `order_id` may appear multiple times in the source data, each with a different transaction timestamp. The most recent timestamp indicates the latest update, and this is the update that should be reflected in the `all_orders` model.
The `all_order_products` model keeps track of product item additions or subtractions from an order. The combination of `order_Id` and `product_id` may appear multiple times in the source data with varying product quantities. This data is included in the transformation process without filtering.
However, one step of the transformation involves aggregating the product quantities and returning a unique combination of `order_id`, `product_id`, timestamp, and the corresponding aggregated `product_quantity`.
To achieve this, a simple append strategy is employed, which happens to be the default strategy in the dbt-teradata adapter, and thus is not specifically stated in the model configuration.

Running Initial Transformations

Execute `dbt run` to perform the initial transformations on the baseline data.

As seen in the compiled code of the corresponding models all data in the sources is included in the transformation. The compiled code of dbt models is found in the `./target` directory.

/* all_orders*/
WITH add_new_orders AS (

  SELECT * FROM "teddy_retailers"."stg_orders"

)
select * from add_new_orders

/* all_order_products */
with add_new_order_products as (

  select * from "teddy_retailers"."stg_order_products"

),
final as (
  select 
    order_id,
    product_id,
    order_product_duration,
    sum(product_quantity) as product_quantity
  from add_new_order_products
  group by 1, 2, 3

)
select * from final

Use your database client to check the statistics on both `all_orders` and `all_order_products` by running:

HELP STATS teddy_retailers.all_orders;
HELP STATS teddy_retailers.all_order_products;

At this point, the statistics will point to:

858 unique `order_id` records in `all_orders`.
8117 unique `order_id`, `product_id` combinations in `all_order_products`.

The `post-hooks` used in the definition of both `all_orders` and `all_order_products` enable the tracking of statistics. This is a topic that we will cover in future articles.

Explore the fact data sliced by specific dimensions using the query scripts located in references/query:

/* Example of slicing by the orders dimmension */
SELECT 
    teddy_retailers.dim_orders.order_status,
    COUNT(teddy_retailers.dim_orders.order_id) AS currently_in
FROM teddy_retailers.dim_orders
GROUP BY 1
ORDER BY currently_in DESC;

Analyzing the orders by status:

delivered — 310
paid — 283
open — 265

Loading Updates on the Sources

To simulate the batch process of data loading (loading updates on the baseline data), execute the `update_data` script located in `references/inserts`:

INSERT INTO teddy_retailers.source_orders 
  (id, customer_id, status, order_date)
  VALUES 
    (855, 98, 'delivered', '2023-01-30');
INSERT INTO teddy_retailers.source_orders 
  (id, customer_id, status, order_date)
  VALUES
    (856, 583, 'delivered', '2023-01-30');
INSERT INTO teddy_retailers.source_orders 
  (id, customer_id, status, order_date)
  VALUES
    (859, 735, 'open', '2023-02-01');
INSERT INTO teddy_retailers.source_order_products
  (order_id,product_id,product_quantity)
  VALUES
    (859,42,4);
INSERT INTO teddy_retailers.source_order_products
  (order_id,product_id,product_quantity)
  VALUES
    (859,22,1);

In this script:

The status of orders 855 and 856 has been updated to delivered.
A new order, 859, has been added with the status of open.
4 items of `product_id` 42 and 1 item of `product_id` 22 have been added to order 859.

Running Transformations on Updates

Execute dbt run:

This time only the recently added, or modified, data is part of the transformation, as seen in the compiled code of the corresponding models:

/* all_orders*/
WITH add_new_orders AS (

  SELECT * FROM "teddy_retailers"."stg_orders"

  -- this filter will only be applied on an incremental run
  WHERE BEGIN(order_duration) > (SELECT MAX(BEGIN(order_duration)) FROM "teddy_retailers"."all_orders")
)
select * from add_new_orders

/* all_order_products */
with add_new_order_products as (

  select * from "teddy_retailers"."stg_order_products"

  -- this filter will only be applied on an incremental run
  WHERE BEGIN(order_product_duration) > (SELECT MAX(BEGIN(order_product_duration)) FROM "teddy_retailers"."all_order_products")
),
final as (
  select 
    order_id,
    product_id,
    order_product_duration,
    sum(product_quantity) as product_quantity
  from add_new_order_products
  group by 1, 2, 3

)
select * from final

Use your database client to check the statistics on both all_orders and all_order_products as before:

HELP STATS teddy_retailers.all_orders;
HELP STATS teddy_retailers.all_order_products;

At this point, the statistics will point to:

859 unique `order_id` records in `all_orders`. This reflects the added order
8119 unique `order_id`, `product_id` combinations in `all_order_products`. This reflects the two product items that are present in order 859.

Explore the fact data sliced by specific dimensions using the query scripts located in references/query:

/* Example of slicing by the orders dimmension */
SELECT 
    teddy_retailers.dim_orders.order_status,
    COUNT(teddy_retailers.dim_orders.order_id) AS currently_in
FROM teddy_retailers.dim_orders
GROUP BY 1
ORDER BY currently_in DESC;

Analyzing the orders by status at this point will return the following results:

delivered — 312. Reflects the two orders that changed status from paid to delivered.
paid — 281. Reflects the two orders that changed status from paid to delivered.
open — 266. Reflects the new order with open status.

Conclusion

In this article, we explored the implementation of dbt incremental models with Teradata Vantage temporal tables, focusing on the sample ELT process of the fictional store Teddy Retailers.

By leveraging temporal tables and dbt’s capabilities, we achieved incremental ingestion of newly inserted or updated data into our dbt models. This advanced use case provides valuable insights for data engineers looking to optimize data pipelines with Teradata Vantage and dbt.

We eagerly anticipate the valuable insights and wealth of knowledge you possess on this topic. Please don’t hesitate to share your thoughts in the comments section; we are excited to hear from you.

For further questions or assistance with this project, reach out to our developer community.

Learn more

About implementing other dbt advanced topics with Teradata Vantage

About the Author

Daniel Herrera is a builder and a problem solver, constantly fueled by the opportunity to create tools that aid individuals in extracting valuable insights from data.

Daniel has held the position of Technical Product Manager, specializing in Data Ingestion and ETL (Extract Transformation and Load) for enterprise applications. Over the past three years, he has actively contributed as a Developer, Developer Advocate, and Open-Source Contributor in the Data Engineering space for traditional and decentralized applications.

Daniel holds a certification as a Cloud Solutions Architect in Microsoft Azure. His proficiency extends to a range of programming languages including SQL, Python, JavaScript, and Solidity.

Connect with Daniel on LinkedIn