Leverage Node-Selection for Data Validation Checks in dbt PR Reviews

Automate data checks on resource groups for comprehensive coverage

Published in

In the Pipeline

4 min readJul 10, 2024

If you’ve used dbt for a while, you know how convenient node selection is. It can save time and compute costs by only running specified resources after updating your project. For instance, the ultra-useful state:modified+ selector allows you to run only modified resources and their downstream, scoping your build to only the modified area of the project.

In this post I’ll show how you can apply the same node-selection logic to data checks on your dbt project as part of PR review. I’m not talking about dbt tests, but checks to automate data validation and make sure that your critical models are covered no matter which part of the lineage you’re working on.

Avoid data ‘spot-the-difference’

After updating models in your dbt project, the first thing you want to see is the impact to data., and you most likely just care about change. E.g.:

Show me data-mismatches from before-and-after updating my dbt project
Help me investigate these changes in a visual way

This can be incredibly difficult for large projects because a single model change can impact numerous downstream models. It’s too many to manually check, and the last thing you want is to be playing data ‘spot-the-difference’ to try and eyeball impact between dev and prod.

Use this updated pull-request comment template for your dbt data projects

Ensure a structured approach to data-change validation in your data project PRs with a better comment template

medium.com

How node-selection helps

Just look at this partial view of Mattermost’s Data Warehouse lineage, a project with around 200 models:

Large dbt projects can be hard to navigate and assess impact

If you make a change in such a large project, how should you go about checking all the models that are downstream of your change?

Manually check those models? That’s a lot of work.
Set up project wide checks? That’s a lot of unnecessary checks on resources that are outside the impact area.

Node-selection makes impact assessment more manageable

This is where node selection is most useful. You can filter models and resources based on your conditions. Here’s an actual PR from the Mattermost project that shows only the modified+ models.

Node selection helps to scope lineage and make impact assessment more manageable

This gives you a more manageable view of the lineage, but the potentially impacted area changes with each PR, so you need a way to automatically apply checks to all models in the impact area.

How to automate checks on subsets of models

You know how to select subsets of resources using node-selection. With Recce you can blanket apply data validation checks to these groups of nodes using the same node-selection syntax that dbt uses.

No need to write checks for each individual model
No need to write checks for each PR

Here’s an example config for Recce that will do the following:

Row count diff on modified+ that are materialized as tables
Schema diff on modified models

checks:
- name: Row count diff
  description: Check the row count diff for all table models
  type: row_count_diff
  params:
    select: state:modified+,config.materialized:table
- name: Schema diff
  description: Check the schema change in modified models
  type: schema_diff
  params:
    select: state:modified

You didn’t have to specify any individual models — Node selection syntax lets you target all resources that are matched by the select option.

High signal with minimal config

If there’s a mismatch on any model Recce will show it in the check summary that’s posted to your PR comment thread.

Recce PR summary will show which resources have data mismatches

Then you can launch Recce in review mode and check out exactly what change occurred and find the root cause of the issue.

Customizable check scopes

The beauty of node selection is that you can really configure which models are checked as you wish.

Want to only check critical models? Just use a tag.
Want to only check modified models? Use modified.
Want to only check materialized tables? use config.materialized:table.

You could group models into categories and assign tags, then target those models with specific checks. No matter what is going on in the PR, if any of those models fall within the impact area, it’ll be checked.

How you can get it

Recce has node-selection based data checks as part of its data validation suite. You can get Recce OSS now, and Recce Cloud for teams is launching soon.

Next-Level Data Validation Toolkit for dbt Data Projects — Introducing Recce

The ultimate data modeling validation toolkit for comprehensive PR review that doesn’t slow down merge times