dbt tests are not enough.

Published in

In the Pipeline

3 min readOct 27, 2023

If you’re a good driver, you check your mirrors often. Mirrors help you maintain awareness of other vehicles behind you and adapt to the changing condition of the traffic. If you’re a great driver, though, you also check your blind spots — the area not covered by mirrors, the unknown.

dbt tests are your rear-view mirrors

dbt tests are like your rear-view mirrors — they cover the areas that you know may have problems, and that need constant checking to ensure your safety (the quality of your data). You set up dbt tests for issues that you can foresee, such as ensuring the format of the data, nulls or empty values, uniques, relations and so forth. But what about the data blind spots?

dbt tests cover the areas you know about, what about your blind spots?The real danger is in your blind spot

Not looking over your shoulder when you change lanes is risky. You might be okay, but the only way to guarantee safety is to check your blind spot.

The same is true for data projects. You can merge a PR if dbt build works, but do you really know if the data is good? If you only test for ‘knowns’ you’re essentially merging without looking.

Spot check the data

Doing data spot-checks is like checking your blind spot. A spot-check is a way to confirm that a data modeling change is not messing up the data. dbt tests might pass, but that doesn’t mean the data didn’t change, and you need to check that any change is desirable. More importantly, historical/production data should not have changed.

You can do these kind of checks to validate your changes as you work, and then include them as proof of correctness in your pull request comment to help the review process. Together with dbt tests, spot-checks will give you full coverage — the structural integrity of the data, and confirmation that the data is correct.

The problem is that data projects are freaking huge now, and there’s no way to see everything, how can you know where to check?

Data drill down

Start by understanding the zone of impact of your changes. You can do this with a lineage DAG diff, it’ll show you how the DAG has changed from before-and-after making modeling changes. A DAG diff provides an actionable starting point for your data review, as opposed to the standard dbt DAG, which represents one state of the DAG.

From the lineage DAG diff you know where to check, from here you can test modified models and critical downstream models with some holistic checks like profiling or value diffs.

If profiling stats reveal some possible unintended impact, or you want to double check the data, you can drop into actual queries and run a query diff to compare the actual data between prod and dev.

This kind of varying granularity of data validations is extremely useful for checking your work, and for including in your pull request comment to speed up the review process of your PR.

‘Thoughtful PR Review’ is now a requirement for data jobs

PR review is the ‘ Point of no Return’ — The last checkpoint before code is merged and prod data is changed. How do you…

medium.com

We’re working on the solution

We recently released Recce, the data-modeling validation toolkit for dbt, specifically aimed at helping data practitioners validate their work and lowering time-to-merge for critical data-modeling updates. Recce OSS is available now, find out more in the video below.