Mastering data testing with dbt — part 2

Atheon Analytics Ltd
6 min readAug 16, 2024

--

By Dominik Golebiewski & Konrad Maliszewski

At Atheon we’re running thousands of data tests to monitor our data pipeline every day. This approach has delivered a number of useful benefits to us, as discussed in Part 1 of this series. However, running all tests after each run is very, very costly. At the time of deployment, it was forecasted to cost us more than data transformations itself. Clearly, this was unacceptable. We quickly had to find a way of reducing these costs to something more manageable. We realised that many (most?) of our models are incremental, and even when they aren’t, we’re only processing a batch of new data every day. This means that most of the time most of the data isn’t changing. And yet we’re testing it again and again. Do we need to do that?

What dbt offers

dbt, of course, has thought about this problem. There is a where field in the data test config, which enables testing on the subset of data. The most often proposed use of this config is to filter the date column being equal to the current date. The options of this additional configuration have developed quite a bit in recent releases but when we were solving this problem (over 2 years ago) it was still quite limited.

Our specific challenge

What dbt offered out of the box at the time might have worked for certain of our models. But if you remember our context — multi-source and multi-tenant tables, and the fact that we don’t really know what’s being processed when — we needed to get smarter about identifying a batch for testing. What is more, while we have models which simply load new data, we also have numerous models that aggregate data. This means that identifying which rows should be tested is not as straightforward as defining one condition. It might be easier to illustrate with an example.

Example — daily to weekly grain aggregation

If we imagine that we’re collecting sales data for a particular client. When that data is at a daily grain, for a day grained fact model, we do a simple delete/insert of the daily sales values. When we want to retest that table for unique records after new data has been inserted, we could just test those inserted days to ensure there’s no duplicates (or the latest timestamp of the records). However, when those dates get aggregated into weekly figures, it’s easy to imagine that a few days scattered throughout the last few years won’t provide an immediate indication which rows should be tested. We would need a mechanism to identify which dates are in which week and which were affected, etc. It gets quite complex quite quickly.

Solution

This is when we decided to settle on a more scalable and flexible approach. To know what to test, we need to return the model’s minimum date for the most recently loaded timestamp (timestamp will determine when the whole batch, regardless of source and tenant, was loaded). Rather than doing potentially very complex operations to identify which rows should exactly be tested, we identify the minimum date within the batch of data and then test everything since that date. This means that sometimes we retest almost the whole table when, e.g. one day of data three years ago is reinstated. However, this happens infrequently enough that it means we can greatly reduce costs (and latency) without the need to maintain very complex logic to identify what should be tested. This is a relatively simple improvement on the core functionality, but it offers a great saving and is very flexible. It applies equally well to simpler models where we just delete/insert data as well as those which do more complex aggregations.

Example of how macros work out what to test

These methods allow for selective aggregation or testing of records that meet the specified conditions. For instance, testing only the most recent records, like the 3 marked in red above, could lead to false negatives. If the logic for generating surrogate keys changes, records may not match, resulting in duplicates. By focusing exclusively on the latest records based on timestamps, we risk overlooking such discrepancies, potentially leading to false positives in testing. When aggregating data, using only partial data (e.g., at the day level) could result in inaccurate outcomes. To counter this, we implemented an additional method that goes beyond timestamps, calculating the earliest date for the most recent timestamp and performing aggregation or testing from that point forward. This approach has proven to be both efficient and cost-effective, while maintaining reliability. If data is restated from a previous date, such as 01/01/2023, this method ensures that the restated batch does not disrupt results for any data from that point onward to the most recent day.

Technical overview

We achieve everything with a set of two macros:

{% macro get_max_runtime(model=none, timestamp_column=none) %}

{% set timestamp_column = timestamp_column or 'runstartedtime' %}

{% set query %}
select coalesce(max( {{ timestamp_column }} ),'0')::timestamp
from {{ model }}
{% endset %}

{% set result = run_query(query) %}

{% if execute %}
{% set query_output = result.columns[0].values() %}
{% else %}
{% set query_output = [] %}
{% endif %}

{% do return(query_output[0]) %}

{% endmacro %}
{% macro get_delta_date(model, observation, date_column=none, timestamp_column=none) %}

{% set timestamp_column = timestamp_column or 'runstartedtime' %}
{% set date_column = date_column or 'day_date' %}

{% set query %}

select {{ observation }}({{ date_column }})::date
from {{ model }}
where {{ timestamp_column }} >= '{{ get_max_runtime(model=model, timestamp_column=timestamp_column) }}'::timestamp

{% endset %}

{% set result = run_query(query) %}

{% if execute %}
{% set result_list = result.columns[0].values() %}
{% else %}
{% set result_list = [] %}
{% endif %}

{{ return(result_list[0])}}

{% endmacro %}

These macros enable us in turn to implement custom versions of unique_combination_of_columns, not_null and relationships tests. For example:

{% test unique_combination_of_columns(
model,
combination_of_columns,
date_column='day_date',
timestamp_column='runstartedtime',
quote_columns=False,
batch_test=False,
enabled=True
) %}

{% if enabled %}

{% set is_batch_test = var("batch_test", batch_test) %}

{% if not quote_columns %}
{%- set column_list=combination_of_columns %}
{% elif quote_columns %}
{%- set column_list=[] %}
{% for column in combination_of_columns -%}
{% set column_list = column_list.append( adapter.quote(column) ) %}
{%- endfor %}
{% else %}
{{ exceptions.raise_compiler_error(
"`quote_columns` argument for unique_combination_of_columns test must be one of [True, False] Got: '" ~ quote ~"'.'"
) }}
{% endif %}

{%- set columns_csv=column_list | join(', ') %}

with validation_errors as (

select
{{ columns_csv }}
from {{ model }}
{%- if is_batch_test %}
where {{ date_column }} >= '{{ get_delta_date(model=model, date_column=date_column, timestamp_column=timestamp_column, observation="min") }}'
{%- endif %}
group by {{ columns_csv }}
having count(*) > 1

)

select *
from validation_errors

{% endif %}

{% endtest %}

Usage example

To then use these tests in the batch mode, we do the following. First, we configure each test in the model’s YAML file:

- name: fact_day_sales
tests:
- unique_combination_of_columns:
combination_of_columns:
- day_column
- dim_one
- dim_two
- dim_three
batch_test: True
date_column: day_column
timestamp_column: timestamp_column
enabled: False

It depends on what we set as a default in the configuration but with the batch_test parameter set to True, executing the dbt test command will only test the latest batch of data. It will filter down to the minimum date for the maximum timestamp. This can be reversed (by setting batch_test: False) and then we can control the batch testing in the command line with:

dbt test --vars '{batch_test: True}'

Alternatively, if we wish to override the YAML configuration where batch_test is set to True, and process the entire dataset, passing False into the batch_test variable via the command line will achieve this, as it takes precedence over the configuration parameter.

dbt test --vars '{batch_test: False}'

Data testing costs down by 80%

This is not an exaggeration. We were literally on a path for our regular production data tests to cost more than data processing itself. We noticed it after the first couple of days of tests running regularly in production. Thanks to the batching approach, we have been able to reduce the cost of our tests by 80%, which brought them down to a more acceptable level. It also meant they completed relatively fast and didn’t clog up the pipeline. There was a risk for a moment that we’d have to increase our machines significantly to ensure that tests finished before another batch of data started getting processed. Given developments in the dbt core, we might need to go back and refactor but so far the tests have been working great for us.

Stay tuned for the final part of this series, in which we discuss the last aspect of data testing, namely test scheduling. There were some trade offs we had to make as running all tests after every run was prohibitively expensive.

--

--

Atheon Analytics Ltd
Atheon Analytics Ltd

Written by Atheon Analytics Ltd

Our purpose is to humanise data so that people can make complex decisions with confidence.

No responses yet