dbt Tests
A necessary assertion to ensure the reliablity of the building blocks
Table of Content
- Overview
- Types of Tests in dbt — Generic Tests & Singular Tests
- Additional Advance Tests
- How to run tests
- Test “severity”
- Test Implementation Tactics
- Appendix
Overview
From dbt official documentation (Add tests to your DAG | dbt Developer Hub ):
Tests are assertions you make about your models and other resources in your dbt project (e.g. sources, seeds and snapshots). When you run
dbt test
, dbt will tell you if each test in your project passes or fails.
In short, there are 2 types of in dbt tests:
- Generic tests, which are set up by macros, and some of them are already available out of the box.
- Singular tests, which are
.sql
scripts that you expect to return zero rows, otherwise it failed.
Generic Tests
Generic Tests are put in place by calling a macro. They are defined in a .yaml
file within the folder where the model is.
There are 4 Out-of-the-box tests provided by dbt:
- not_null : check a column to validate there are no null values.
- unique: check that values in a column don’t contain duplicates.
- accepted_values: define a list of expected values a column can have.
- relationships: defines a column on the current table that references another column in another table, i.e. asserting Referential Integrity.
Under the hood, dbt constructs a select
query for each test, using the parameterized query from the generic test block. These queries return the rows where your assertion is not true; if the test returns zero rows, your assertion passes.
Asserting if there is any NULL value under the selected column.
By default, this test will fail even if just 1 NULL record is found. This behavior can be adjusted in Severity.
Asserting if the column is unique across the whole table.
Asserting if the value in the column is within the expected values.
version: 2
models:
- name: orders
columns:
- name: status
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'returned']
- name: status_id
tests:
- accepted_values:
values: [1, 2, 3, 4]
quote: false
It asserts the Relationship between the 2 models. The Relationship is based on the Entity Relationship (ER) Diagram. A typical example would be the id of the Buyer in the Orders table should always be found in the Customer dimension table.
The following example tests that every order’s customer_id
maps back to a valid customer
:
version: 2
models:
- name: orders
columns:
- name: customer_id
tests:
- relationships:
to: ref('customers')
field: id
Singular Tests
Singular Tests are custom sql scripts we can build to test specific values.
- You can think of it as a query from which you expect zero results: if there are returned rows, then the test will fail.
- These are defined in
.sql
files, typically in yourtests
directory. - You can use Jinja (including
ref
andsource
) in the test definition, just like you can when creating models. - Each
.sql
file contains oneselect
statement, and it defines one test. - For reference, you can check the tests folder in the main root of the project. And the folder structure should be mirroring the
model
folder as well
Additional Advance Tests
More additional Advance Tests are available after implementing dbt-utils and Elementary, more advance Tests are available.
dbt-utils
The following tests come from the dbt-utils package.
- recency
- expression_is_true
- accepted_range
- not_constant
- not_empty_string
- not_null_proportion
- not_accepted_values
- relationships_where — The advanced version of the
relationship
.
Elementary — Data Anomaly Detection
Elementary also adds extra dbt Tests for anomaly detection, and it comes in 3 different levels:
- Table level monitors
- Column level monitors
- Dimension monitors
How to run tests
in dbt, tests are run via the dbt tests
command (About dbt test command | dbt Developer Hub ). Examples:
Run tests on one specific model:
dbt test --select hourly_tmp_dim_vc_usr_customer
Run tests on a model and all its dependencies:
dbt test --select hourly_tmp_dim_vc_usr_customer+
Run tests on a model and all upstream models:
dbt test --models +hourly_fct_vc_usr_client_stats
Run tests on a list of models (separated by whitespace only):
dbt test --models hourly_fct_vc_usr_client_stats hourly_fct_vc_usr_client_stats hourly_dim_vc_staging_product_staging
When it comes to model select, the choices in dbt is actually very flexible — please see Methods | dbt Developer Hub under Node selection section.
Test “severity”
Configuring test `severity` | dbt Developer Hub
dbt support 2 kinds of severity for failed dbt test: error
and warn
.
The relevant configs are:
severity
:error
orwarn
(default:error
)error_if
: conditional expression (default:!=0
)warn_if
: conditional expression (default:!=0
)
For example,
version: 2
models:
- name: large_table
columns:
- name: slightly_unreliable_column
tests:
- unique:
config:
severity: error
error_if: ">1000"
warn_if: ">10"
This yml
file is saying a couple of things:
- for the test of the column
slightly_unreliable_column
, make sure there is no duplication (unique
test) - the severity of this test is error (default value), BUT:
- It would start generating
warn
messages in the logs (CLI) if there are more than 10duplicated records; AND, - The test will fail if there are more than 1000 duplicated records found, which makes the run of
dbt test
failed
Severity could also be configured in various locations in the dbt repo, please see here (Configuring test `severity` | dbt Developer Hub ) for details.
Note
- Only use Error for business-critical bases, as data quality issue like duplication of those would have big impact of a lot of tables downstream
- Most issues are minor and shouldn’t break a whole pipeline, so Warning is more than enough
- It would also be nice if there are different layers of Warning to help assessing the impact
Test Implementation Tactics
Primary Key
Even though Snowflake and the other DWH solutions do not enforce Primary Key, it is still very vital to set the Primary Key to all tables.
To assert the Primary Key of the table, it is suggested to use both of the following Test:
Surrogate Key
The ingested tables do not always come with a clear key. By working with stakeholders at Product, Tech, and Business, Surrogate Key can be generated with the relevant domain knowledge (perfect example here).
The assertion of Surrogate Key is similar to Primary Key:
It is also recommended to generate the Surrogate Key by using Macro from dbt-utils for the sake of standardization. BUT note that it is very critical to conduct enough investigation to find out the hidden granularity in the table before the implementation.
NULLs sometimes
During data development, there is a lot of times that the BI team doesn’t have in-depth domain knowledge yet.
It is always useful to start with this one for new column:
If the not_null
test doesn’t pass, we could fall back to make use of where config to make a selective test, or use not_null_proportion
. The rule of thumb is always to start very strictly, then fall back step by step.
Example:
version: 2
models:
- name: large_table
columns:
- name: sometimes_null_column
tests:
- not_null:
config:
where: "category = 'category that we know they shouldn't be NULL'
Dimension columns with categorical values
Suggested to use these 2 together:
not_null
accepted_value
Metrics columns with continuous values
Suggested to use:
not_null
accepted_range
Future TIMESTAMPs
In general, future TIMESTAMPs are not a very critical data quality issue.
Suggested to use:
not_null
accepted_range
ORexpression_is_true
Note:
- Also while comparing TIMESTAMPs, it is very vital to keep in mind the timezone of different TIMESTAMP values.
Referential Integrity issues
What is Referential Integrity?
A typical example of a referential integrity issue is the customer_id
in the Sales table cannot be found as customer_id
in the Customer table table.
Suggested to use:
relationship
- (If the previous one fails, fall back to)
relationships_where
Appendix
Compile
There is an additional check that can be run In dbt: Compile
dbt compile
According to dbt, it would:
dbt compile generates executable SQL from source model, test, and analysis files. You can find these compiled SQL files in the target/ directory of your dbt project.
If there are files in the repository that cannot be generated into executable SQL, this test would fail.