Data transformation with dbt and Teradata Vantage
dbt (Data Build Tool) is a data transformation tool that is the cornerstone of the Modern Data Stack. In this post, we will explore how to use dbt to transform data in the Teradata Vantage data warehouse.
dbt takes care of the T in ELT (Extract Load Transform). The assumption is that some other process brings raw data into your data warehouse or lake. This data then needs to be transformed to the internal representation of your choice, be it the 3rd Normal Form, Data Vault, or any other implementation you chose. Finally, the internal representation needs to be translated to the consumable format required by tools used by data consumers. It could be the dimensional model for BI tools or some other custom format, like features for AI/ML processing. With dbt, you can do all these transformations within your data warehouse or lake, using SQL to manipulate the data, git to version the transformations, and dbt data tests to verify the results of your data pipelines.
dbt has a sample project called Jaffle Shop. The project shows how raw input data: customer set, order set, and payment set, are transformed into a dimensional model that can be consumed by a BI tool.
Here is the starting point:
And here is the output model that we need:
As you can see, the customers dimension requires that we do aggregation (first_order, most_recent_order, number_of_orders, total_order_amount) and the fact table requires a join between orders and payments tables.
The original Jaffle Shop project uses Google BigQuery. I’ve forked the project to https://github.com/Teradata/jaffle_shop-dev and made a couple of changes to adjust the SQL to the Teradata syntax. I’m now going to walk you through the models (subsequent transformations) that the sample project applies.
We first build staging tables based on the source tables. In our example, this step is trivial as the raw data sets are simple. In real life, this step would usually take care of normalizing data formats, aligning naming conventions, changing data units. In our case, we select from the raw tables, e.g.:
with source as (
{#-
Normally we would select from the table here, but we are
using seeds to load our data in this project
#}
select * from {{ ref('raw_customers') }}
),
renamed as (
select
id as customer_id,
first_name,
last_name,
email
from source
)
select * from renamed
Notice, how this transformation, or model in dbt parlance, relies on common table expressions (CTE) syntax. CTE offers a clean way to decompose complex SQL. If you need a refresher on CTE see the Teradata Vantage — SQL Data Manipulation Language documentation.
Now, that we have our input data sets staged, it’s time to do joins and aggregates.
We first aggregate order data by user (we will call this model customer_orders in later steps):
with orders as (
select * from {{ ref('stg_orders') }}
),
final as (
select
customer_id,
min(order_date) as first_order,
max(order_date) as most_recent_order,
count(order_id) as number_of_orders
from orders
group by 1
)
select * from final
Join orders with payments (customer_payments):
with payments as (
select * from {{ ref('stg_payments') }}
),
orders as (
select * from {{ ref('stg_orders') }}
),
final as (
select
orders.customer_id,
sum(amount) as total_amount
from payments
left join orders on payments.order_id = orders.order_id
group by 1
)
select * from final
And finally aggregate payments by order (order_payments):
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with payments as (
select * from {{ ref('stg_payments') }}
),
final as (
select
order_id,
{% for payment_method in payment_methods -%}
sum(case when payment_method = '{{payment_method}}' then amount else 0 end) as {{payment_method}}_amount,
{% endfor -%}
sum(amount) as total_amount
from payments
group by 1
)
select * from final
With these intermediate models, we begin constructing the output dimensional model.
We take customers, customer_orders, and customer_payments models and construct the customers dimension:
with customers as (
select * from {{ ref('stg_customers') }}
),
customer_orders as (
select * from {{ ref('customer_orders') }}
),
customer_payments as (
select * from {{ ref('customer_payments') }}
),
final as (
select
customers.customer_id,
customer_orders.first_order,
customer_orders.most_recent_order,
customer_orders.number_of_orders,
customer_payments.total_amount as customer_lifetime_value
from customers
left join customer_orders on customers.customer_id = customer_orders.customer_id
left join customer_payments on customers.customer_id = customer_payments.customer_id
)
select * from final
Finally, we take orders and order_payments models to construct the fact table:
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
),
order_payments as (
select * from {{ ref('order_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{payment_method}}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments on orders.order_id = order_payments.order_id
)
select * from final
With our models configured, let’s run dbt.
We start by loading sample data from CSV files to input tables:
dbt seed
We are now ready to execute the transformations:
dbt run
The sample project contains example tests that validate data at each step of the transformation. You can run the tests with:
dbt test
Since dbt knows about all data transformations, it can produce clean lineage documentation. Once you run these commands:
dbt docs generate
dbt docs serve
You will get documentation for your models with the following lineage graph:
In this post, we have explored what dbt is used for. We went through a sample project, that took raw input data and produced a dimensional model that is useful for BI tools.
Next steps:
- For instructions on installing dbt, clone the sample project and run the steps, see the dbt with Teradata Vantage tutorial.
- To learn more about the dbt Teradata adapter check out dbt-teradata plugin documentation.