How was this tested? A guide to validating models for dbt projects

When it comes to checking data after modeling changes, what steps do you take?

Dave Flynn
In the Pipeline
5 min readJun 26, 2024

--

A pull request comment template is a cornerstone for best practices on data projects. The PR comment serves as a record of the due diligence you performed in checking your work.

PR comment templates are extremely common in software engineering, and they’re making their way over to data projects, but adoption is slow. There are still a lot of teams that aren’t using formal PR comment templates, but if your data is critical to business, it’s a must.

A checklist of steps

The PR comment works because you follow a set or predefined sections, or steps, that help you to think about the work that you’ve done and make sure that you’ve tested it in the correct way.

PR comment checklists are usually geared towards code, and that makes sense for software projects — software is code. So a lot of the checklist is all about coding standards and usage.

For data projects, the code is your SQL and you’ll have standards that the SQL must follow and ways of checking that. But data projects are also data (I’m really dropping bombshells now), so if you only check the SQL, you’re only checking half of the project.

What about data projects?

There are data project pull request templates out there. dbt has one, I even made a version of this with some extra sections and refinements. And you can see examples of PR comments on public dbt projects that follow best practices.

The checklists in these PR comments cover:

  • What type of change this is
  • Did the project build
  • Did you follow the style guide
  • Have you documented anything new
  • etc.
How do you validate model changes in dbt data projects?
You’re on your own

What’s missing?

PR comment templates for data projects all have one shortcoming. You get to a section that is called:

  • validation of models, or
  • how was this tested

And then you’re kind of on your own. What are you supposed to do here? You need to validate the data that was created, but how?

Data Impact Validation Checklist

We need a checklist that covers how to validate data impact. A way that helps to:

  • surface unexpected impact
  • validate expected changes

It’s difficult to have a concrete set of steps that help to check every possible type of data work, from refactor to bugfix, but there are steps that every PR should follow.

Seeing inside the black box

Half the problem is being able to see the impact. Even if you know what to look for, how can you check it.

Imagine you made a logical change upstream to an orders staging table, you know the orders mart is downstream, and this is critical for business reporting. In addition to checking the code of your modified models, you also need to check the downstream orders model. If you can follow the trail to these critical models and find no data change, you can be sure that there’s no unintended impact making its way downstream.

Start wide and narrow your focus

Starting from the widest aspect and working inwards, a general checklist for checking data impact could look like this:

Check the scope of impact

  • Check which models have been modified, is this correct?
  • Are there accidental or unexpected modifications to models outside the scope of this work

Check the data in modified models

  • Based on your intent and expectation is the data correct?

Look for potentially impacted resources downstream

  • Could the changes I made have any impact on the usage of this data
  • Are there critical models downstream of my changes

You will have some idea of which models are critical, and ideally these should already be defined by the team.

Check for structural impact on critical models

  • schema changes
  • row count changes

Perform Holistic checks on critical models

  • Data profiling
  • Data reconciliation

Do some data spot checks

  • The results of queries
  • Row comparisons from dev and prod

Let’s put it in an actual checklist and see what it looks like:

The checks might seem obvious, but that’s the point. It should be done with each PR, and therefore it needs to be stated.

Your checklist may vary, but this gives you an idea. The main thing is that the PR author will have these things in mine for every PR. Whether that’s yourself, or a colleague, it’ll help you to validate your work, and then help the PR reviewer to properly review it.

What’s next for data project PR checklists?

Data projects ain’t getting any simpler, changes are often split among multiple pull requests, and data checklists need to be saved as part of the data audit trail. We could do with the ability to:

  • Save checklists
  • Reproduce checks
  • Reference checks across multiple PRs
  • A way to collaborative data checks

The last one is an interesting one, because it really opens the door for collaborative PR review. If multiple team members can perform and review each other’s checks, it turns the PR review from a one-way static process, into a team job.

What are the things that you check for every PR before submitting your work?

Data Validation Checklist

Recce has a built-in data validation checklist specifically designed to validate data modeling changes.

  • Maintain a list of checks during while you’re working
  • Submit your checks with your PR comment
  • Automate checks in CI
Recce data validation checklist for comprehensive PR review
Data Validation Checklist for PR review in DataRecce.io

--

--

Dave Flynn
In the Pipeline

Dave is a developer advocate for DataRecce.io — the data modeling validation and PR review toolkit for dbt data projects