Unit Testing on dbt Models Using a Static Test Dataset in Snowflake

How to validate your assumptions while protecting yourself from making future mistakes

Venkat Sekar
Hashmap, an NTT DATA Company
7 min readJun 30, 2020

--

Testing on dbt built models is one of the core functionalities that is sometimes overlooked. Out of the box, dbt provides two different mechanisms for data validation schema tests and custom data tests. In my opinion, however, these are too basic; they do not clearly articulate how the data should be transformed or the expected output.

I also find it redundant in writing expressions both in the model as well as a Testing Expression. A coding error could happen in the model or in the expression definition itself. Another point of view is that if the model involved complex window-based operations like (first_value, last_value, etc.) these cannot be easily expressed in these test datasets.

Based on my experience, it was always easy for me, as a developer, to sit down with a business analyst, open up an excel sheet and fill in input values (from the input table) and expected output values (in the transformed table). This gave us both an understanding and confirmation of the exact logic that the analyst has asked me to implement.

While doing this exercise it was also easy for me to understand complex operations that the analyst wanted me to perform — be it a date difference calculation, window operations, or applying average across a specific group.

If you reconnect with your own experience, expressing these in the testing expression is simply not possible.

A discussion on the dbt discourse: Testing with fixed data set, gave direction on how we can perform actual data testing using dbt. I highly recommend reading the discourse link to get an initial understanding of the discussion and how it was solved. The prototype code demonstrated here is an implementation and walkthrough of how this can be done, based on the discussion. You can find the code for this on my GitLab.

NOTE 1: The model/transformation involved is meant for demonstration purposes; the business logic behind why the calculation is done and how it is done, should not be of concern. The underlying intent is to demonstrate how to perform the testing on a test data set.

NOTE 2: I am expecting the audience to have experience in dbt model building and understanding the functionality on:

Scenario

To keep things short and simple, I am using the “LINEITEM” table from Snowflake Sample Dataset (TPCH_SF1). The model behavior is as follows:

  • Copy a sample of the ‘snowflake_sample_data.tpcH_sf1.lineitem’ records into an ‘order_line’ table
  • Derive calculated field ‘line_price’ and ‘discounted_line_price’
  • For each order aggregate no of line items,line_price,discounted_line_price
  • Materialize the result into the ‘ORDER_LINEPRICE_SUM’ table.

The screenshot below demonstrates the input and the expected result after completion of transformation:

The following diagram depicts the data pipeline based on dbt docs:

Execution

The set of commands to run the test is as follows:

I will explain the details later in the article, but for now, take a look at the below screenshots which demonstrate the sample output of the execution run:

Successful execution

The data preview of ‘ORDER_LINEPRICE_SUM’:

The command line output of dbt run:

Failed execution

To demonstrate a failure scenario, I am replacing:

with:

The data preview of ‘ORDER_LINEPRICE_SUM’:

The output of the dbt run is as below:

The failure is due to the incorrect calculation done on the number of lines.

Implementation

The test data is stored in the “test-data” folder, which is configured as a “data-path” in the dbt_project.yml

I follow a convention that <<source_table>>_INPUT to indicate that this seeded data will be used as an input. And the <<target_table>>_EXPECTED to indicate that this is the expected result once the model transformation is completed.

The testing is defined in the TEST_DB_SCHEMA.yml file. The definition is as below:

NOTE: You are testing against the expected seed table. Defining the test against the target model (ORDER_LINEPRICE_SUM) does not handle well in all situations, whereas defining testing against seeded data is easier.

If you have different variations of input, since the underlying table changes from initial data load to ongoing delta loads, you would also end up with different versions of the expected resultant records as well. So, defining the test against this variation with the same underlying table is possible.

You can also test against a specific seed like below:

Pointing to the input seed table

If you followed the initial dbt discourse: Testing with fixed data set; it was discussed that the custom implementation of ‘ref’ macro might need to be developed. This is how I have it done.

Macro: tbref

I have the macro ‘tbref’ defined in [macros/utility_functions.sql](macros/utility_functions.sql. This function takes 2 parameters:

  • model_name_key => a key to look for in the var command line argument
  • model_name => the default model, incase the var is not supplied

A sample call of this macro is as follows ex: ORDER_LINEPRICE_SUM.sql

By default target, the model ‘ORDER_LINE’ will be taken into consideration. When we are performing a ‘test target’ and if the var argument is supplied with the ‘ORDER_LINE_K’, then the model present in the var argument will be considered. The ‘taget=test’ is done to indicate testing as well as a preventive measure.

Command-line

The below command line (mentioned earlier), indicates that dbt run is against the test target and the ORDER_LINE_K points to table ‘TEST_DB.PUBLIC.ORDER_LINE_INPUT_IDL’; which will be used as the input to the ‘ORDER_LINEPRICE_SUM’ model transformation:

If you want to run under the default target, the invocation will be as follows:

Final Thoughts

What was demonstrated here is a simple and effective way to do unit data testing. The test data preparation would definitively take your development cycle times, but it protects you from future accidental mistakes as you develop future models or enhance the existing model.

I hope this short prototype helps you, at the very least, as an initial test to get started with.

Need Help with Your Cloud Initiatives?

If you are considering the cloud for migrating or modernizing data and analytics products and applications or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, then please contact us.

Hashmap offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud, dbt, and Snowflake service offerings.

Other Tools and Content for You

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s Podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular apps.

Venkat Sekar is Regional Director for Hashmap Canada and is an architect and consultant providing Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--