Creating a custom data quality check on DBT (data build tool)

Published in

tiket.com

3 min readApr 8, 2022

This article will cover the topic of creating a custom data quality check since there might be various use cases to do quality checks depending on demands/needs for a specific dataset.

Definition

In this context, quality data means that the data ingested into a database/data warehouse would be qualified enough in terms of accuracy, validity, completeness, and cleanliness to be represented in business purpose insight such as reports that support decision-making.

First things first, you should assess the data, define your use cases and decide what’s need to be checked. Nevertheless, this article will demonstrate the general use cases you may encounter, particularly duplicate and null data checks.

Let’s get into how we can create a custom data quality check on DBT.
Disclaimer:
For the data environment, we use Google’s BigQuery.

Write a quality check query:

Given the following dummy data:

The above query will check if there is any duplication or null value on the field that we use as the key.

Utilizing macro on DBT

In this exercise, we will create two important macros, namely the data quality generator and get test values.

Data quality generator

Firstly, we will need a macro that can automatically generate a query based on use cases that we defined earlier and then load the result back into the log table.

Next, we parameterize the values that will be received by this macro:

unique_keys = list of field that will be check
qa_target_table = target table where the result of check will be stored
qa_source_table = a table that will be use as source table for checking

As a result, you can call the macro anywhere and as simple as:

{% do data_quality_generator(unique_keys, qa_target_table, qa_source_table) %}

Sample of log table:

Get Test Values

Remember that in order to know if our data quality is good or not, we need to focus on the objective of our use cases. Hence, we need another macro to get a concrete measurement of data quality. In this exercise, we want to know if data points are either duplicated or null. Therefore, the purpose of this macro is simply to get the results (computed via the previous generator) in the log table by using the following two parameters:

model = refer to field table name on log table
test_type = will be use cases we need to get result from

To get a result, we can input the following command:

{% set duplicate_check_result = get_test_values(tablename, 'use case name') %}

And last but not least, run the validation based on our needs, for example:

{% if duplicate_check_result[0] != "0" %}
  {% do exceptions.raise_compiler_error('duplicate records found') %}

That’s it.

This might not be the best approach, but it is a comfortable approach personally for me.

Thank you for reading. I am willing to be open to any inputs, discussions or improvement.