Unit Test SQL using dbt

Tianchen Wu
4 min readDec 15, 2022

--

After years of working on data science and engineering, data quality is the hanging ghost appears in almost every project, decimating business achievement.

SQL is the de facto language of Data. One way of improving data quality is to enhance SQL codebase with unit test and data test. This article is mostly inspired by Mrs Gao’s post.

In this article, the fundamental logic of unit test on SQL with dbt will be illustrated with simple data set.

Basic Idea

The basic idea of conducting unit test on SQL is exactly the same as doing unit test on Python code:

  1. mock an controllable input D
  2. there is testable module/function/algorithm, call it A
  3. calculate the expected result using D as input, get O_should
  4. compare the expected result (O_should) with the actual output from A (O_is)

If the expected result matches with actual output, then we say the module/function/algorithm is working as expected, otherwise the test fails.

Sometimes it does not have to be an exact match, i.e. matching only up to 2 digits after the decimal point.

An Example

Let’s build a naive example, to illustrate the idea above. We need the following dbt project folder structure

-- dbt_project.yml

-- data/
------ iris.csv
------ selected_iris_expected.csv

-- models/
------ iris/
---------- selected_iris.sql
---------- schema.yml

dbt_project.yml is the configuration file to define seed path, schema, tagging etc.

# example: dbt_project.yml

# take things under data/ as seeds
data-paths: ["data"]

# configure seed, all going into unittesting schema
seeds:
schema: unittesting

iris.csv is our mock data (presumably subset of real iris data), selected_iris_expected.csv is the expected result table based on mock input iris.csv and model selected_iris.sql is the module/function/algorithm to be tested

-- selected_iris.sql
{{ config(
materialized='table',
schema='unittesting',
tags=['iris']
)
}}

-- count the number of special iris id (above average in all aspects)
-- not a very meaningful logic, just for exemplare purpose
SELECT
distinct count(distinct id)
FROM "public"."iris";
where sepallengthcm > 5.9 and sepalwidthcm > 3.1 and petallengthcm > 3.8 and petalwidthcm > 1.2
-- selected_iris_expected.csv
count
150

schema.yml defines the unit test case

# schema.yml 

version: 2

# table model selected_iris should be equal to iris
models:
- name: selected_iris

tests:
- dbt_utils.equality:
compare_model: ref('selected_iris_expected')
tags: ['unit_testing']

After the heavy lifting of preparing mock and expected data, config the dbt, define the tests, there are two steps left.

Step 1: load test data into database

# here we use
# iris.csv will be loaded into unittesting.iris table
# selected_iris_expected.csv will be loaded into unittesting.selected_iris_expected table
dbt seed
# build selected_iris model into unittesting.selected_iris
dbt run

now we have 3 tables in the database under unit testing schema

assert a==b

Step 2: compare

# here we use
# all tests within folder model/iris/ with be executed
# of course, we can restrict to only unittesting using tags
dbt test --model iris

see the test result: PASSED or FAILED

Technicalities

Unit testing SQL looks simple right ? but in reality thing could be more complex:

  • you probably need a more sophisticated folder structure (using nested subfolders) to separate and organise tests according to projects
  • you may need to turn on/off unit testing based on the current environment dev/prod
  • you may have large input table which is hard to mock
  • you may have complex model, upon which it is hard to calculate the expected result beforehand (model not testable)
  • the expected result may not 100% match with actual output (even they are practically the same) due to floating number precision etc
  • or you may simply have no time budget in this project which is not uncommon, people do not unit test sql a lot in 2022, SQL statements are considered correct after having been written

Again here I refer to Mrs Gao’s post for some more details.

A Tip: Modularize SQL Statements

Despite all the challenges listed above, one thing do help with unit testing in the case of SQL as well as any other more general programming languages like Python.

That is Modularization. A good modularized model/function/algorithm guarantees the testability and readability, even for sql.

There are many ways to realise this with sql and dbt:

and other methods like UDF and procedures, which is commonly built into database engines.

Conclusion

SQL is the native language of data, in this article, we demonstrated a way to do SQL unit testing with dbt.

Likewise we could also do data integration test or other tricks like delegating the creation of sql to more powerful language like python using rasgoQL.

--

--