Unit Testing on dbt Models Using a Static Test Dataset in Snowflake

How to validate your assumptions while protecting yourself from making future mistakes

Published in

Hashmap, an NTT DATA Company

7 min readJun 30, 2020

Testing on dbt built models is one of the core functionalities that is sometimes overlooked. Out of the box, dbt provides two different mechanisms for data validation schema tests and custom data tests. In my opinion, however, these are too basic; they do not clearly articulate how the data should be transformed or the expected output.

I also find it redundant in writing expressions both in the model as well as a Testing Expression. A coding error could happen in the model or in the expression definition itself. Another point of view is that if the model involved complex window-based operations like (first_value, last_value, etc.) these cannot be easily expressed in these test datasets.

Based on my experience, it was always easy for me, as a developer, to sit down with a business analyst, open up an excel sheet and fill in input values (from the input table) and expected output values (in the transformed table). This gave us both an understanding and confirmation of the exact logic that the analyst has asked me to implement.

While doing this exercise it was also easy for me to understand complex operations that the analyst wanted me to perform — be it a date difference calculation, window operations, or applying average across a specific group.

If you reconnect with your own experience, expressing these in the testing expression is simply not possible.

A discussion on the dbt discourse: Testing with fixed data set, gave direction on how we can perform actual data testing using dbt. I highly recommend reading the discourse link to get an initial understanding of the discussion and how it was solved. The prototype code demonstrated here is an implementation and walkthrough of how this can be done, based on the discussion. You can find the code for this on my GitLab.

NOTE 1: The model/transformation involved is meant for demonstration purposes; the business logic behind why the calculation is done and how it is done, should not be of concern. The underlying intent is to demonstrate how to perform the testing on a test data set.

NOTE 2: I am expecting the audience to have experience in dbt model building and understanding the functionality on:

Scenario

To keep things short and simple, I am using the “LINEITEM” table from Snowflake Sample Dataset (TPCH_SF1). The model behavior is as follows:

Copy a sample of the ‘snowflake_sample_data.tpcH_sf1.lineitem’ records into an ‘order_line’ table
Derive calculated field ‘line_price’ and ‘discounted_line_price’
For each order aggregate no of line items,line_price,discounted_line_price
Materialize the result into the ‘ORDER_LINEPRICE_SUM’ table.

The screenshot below demonstrates the input and the expected result after completion of transformation:

The following diagram depicts the data pipeline based on dbt docs:

Execution

The set of commands to run the test is as follows:

# Download dbt dependencies
dbt deps# Seed the test datasets
dbt seed# Execute the model
dbt run -m +ORDER_LINEPRICE_SUM --target=test --vars "{'ORDER_LINE_K': 'TEST_DB.PUBLIC.ORDER_LINE_INPUT_IDL'}"# run the test
dbt test

I will explain the details later in the article, but for now, take a look at the below screenshots which demonstrate the sample output of the execution run:

Successful execution

The data preview of ‘ORDER_LINEPRICE_SUM’:

The command line output of dbt run:

Failed execution

To demonstrate a failure scenario, I am replacing:

select 
    l_orderkey 
    ,count(l_linenumber) as no_of_lines
    ,sum(line_price) as total_line_price
    ,sum(discounted_line_price) as total_discounted_line_price
  from line
  group by l_orderkey
  order by l_orderkey desc

with:

select 
    l_orderkey 
    ,SUM(l_linenumber) as no_of_lines --purposefully done to demonstrate wrong calculation
    ,sum(line_price) as total_line_price
    ,sum(discounted_line_price) as total_discounted_line_price
  from line
  group by l_orderkey
  order by l_orderkey desc

The data preview of ‘ORDER_LINEPRICE_SUM’:

The output of the dbt run is as below:

The failure is due to the incorrect calculation done on the number of lines.

Implementation

The test data is stored in the “test-data” folder, which is configured as a “data-path” in the dbt_project.yml

I follow a convention that <<source_table>>_INPUT to indicate that this seeded data will be used as an input. And the <<target_table>>_EXPECTED to indicate that this is the expected result once the model transformation is completed.

The testing is defined in the TEST_DB_SCHEMA.yml file. The definition is as below:

seeds:
  - name: ORDER_LINEPRICE_SUM_EXPECTED 
    tests:
      - dbt_utils.equal_rowcount:
          compare_model:  ref('ORDER_LINEPRICE_SUM')
      - dbt_utils.equality:
          compare_model: ref('ORDER_LINEPRICE_SUM')
          compare_columns:
            - L_ORDERKEY
            - NO_OF_LINES

NOTE: You are testing against the expected seed table. Defining the test against the target model (ORDER_LINEPRICE_SUM) does not handle well in all situations, whereas defining testing against seeded data is easier.

If you have different variations of input, since the underlying table changes from initial data load to ongoing delta loads, you would also end up with different versions of the expected resultant records as well. So, defining the test against this variation with the same underlying table is possible.

You can also test against a specific seed like below:

dbt test --schema -m ORDER_LINEPRICE_SUM_EXPECTED

Pointing to the input seed table

If you followed the initial dbt discourse: Testing with fixed data set; it was discussed that the custom implementation of ‘ref’ macro might need to be developed. This is how I have it done.

Macro: tbref

I have the macro ‘tbref’ defined in [macros/utility_functions.sql](macros/utility_functions.sql. This function takes 2 parameters:

model_name_key => a key to look for in the var command line argument
model_name => the default model, incase the var is not supplied

A sample call of this macro is as follows ex: ORDER_LINEPRICE_SUM.sql

from {{ tbref('ORDER_LINE_K' ,'ORDER_LINE') }}

By default target, the model ‘ORDER_LINE’ will be taken into consideration. When we are performing a ‘test target’ and if the var argument is supplied with the ‘ORDER_LINE_K’, then the model present in the var argument will be considered. The ‘taget=test’ is done to indicate testing as well as a preventive measure.

Command-line

The below command line (mentioned earlier), indicates that dbt run is against the test target and the ORDER_LINE_K points to table ‘TEST_DB.PUBLIC.ORDER_LINE_INPUT_IDL’; which will be used as the input to the ‘ORDER_LINEPRICE_SUM’ model transformation:

# Execute the model
dbt run -m +ORDER_LINEPRICE_SUM --target=test --vars "{'ORDER_LINE_K': 'TEST_DB.PUBLIC.ORDER_LINE_INPUT_IDL'}"

If you want to run under the default target, the invocation will be as follows:

# Execute the model
dbt run -m +ORDER_LINEPRICE_SUM

Final Thoughts

What was demonstrated here is a simple and effective way to do unit data testing. The test data preparation would definitively take your development cycle times, but it protects you from future accidental mistakes as you develop future models or enhance the existing model.

I hope this short prototype helps you, at the very least, as an initial test to get started with.

Need Help with Your Cloud Initiatives?

If you are considering the cloud for migrating or modernizing data and analytics products and applications or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, then please contact us.

Hashmap offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud, dbt, and Snowflake service offerings.

Unit Testing on dbt Models Using a Static Test Dataset in Snowflake

How to validate your assumptions while protecting yourself from making future mistakes

Scenario

Execution

Successful execution

Failed execution

Implementation

Pointing to the input seed table

Macro: tbref

Command-line

Final Thoughts

Need Help with Your Cloud Initiatives?

Other Tools and Content for You

Snowflake Utilities & Accelerators | Do more with Snowflake

Try out all the Snowflake utilities that Hashmap has available and do more with Snowflake: Snowflake Inspector…

Deploying and Running dbt on AWS Fargate

Cost-Effectively Automating Deployment and Transformations in your Cloud Data Warehouse

Housekeeping in Snowflake with SQL and DBT

Be Sure and Take Advantage of Account Usage History in Snowflake

Data Sync to Snowflake Using Confluent Kafka Connect: Part 1

Moving On-Prem Oracle Databases to Snowflake in Azure with Kafka Connect

Data Materialization in Snowflake Using Confluent Kafka Connect: Part 2

Making On-Prem Oracle Databases Useable with Snowflake in Azure with Kafka Connect

Hashmap on Tap | Hashmap Podcast

A rotating cast of Hashmap hosts and special guests explore data and cloud technologies from diverse perspectives while enjoying…

Written by Venkat Sekar