Identify and Automate Data Checks on Critical dbt Models

Dave Flynn
In the Pipeline
Published in
5 min readJun 20, 2024

Do you know which are the critical models in your data project? I’m sure the answer is yes. Even if you don’t have a formal system that ranks the models, you can definitely point to which models you should tread carefully around.

Do you check these critical models for data impact with each pull request? Maybe some models, but it’s probably on a more ad-hoc basis. If they’re really critical models, you need to be aware of unintended impact. The last thing you want to do is mistakenly change historical metrics, or lose data.

Identifying critical models

Knowing the critical models in your project comes from your domain knowledge. You know these models have a

  • particular significance to business,
  • a ton of downstream models,
  • or, from experience, you’ll get a call about that data if something goes wrong

If you check these models in an ad-hoc way, it might be time to apply a more formal ranking system to models, which will also help the triage process if something does go wrong.

Impacted Lineage DAG from Recce showing modified and impacted models on the California Integrated Travel Project dbt project

When to check critical models

Checking that these critical models didn’t change is really important, and you need to do this before merging your updated models into prod. You might have a data observability tool in place to monitor prod data but, if you merge a breaking change, it’s already too late. The downstream damage is already done.

You can check the data in two ways before merging:

  • During development: This is when you’re making SQL changes, and editing models and metrics, checking the data and proactively and looking for data impact.
  • Automated: These checks run automatically in continuous integration (CI). You still need to review the checks, but they are run automatically for each pull request (PR) on your data project.

This mix of manual and automated checks gives you a good chance at comprehensive coverage, but you still need the domain knowledge to identify those critical models.

A plan for checking critical models

Checking the data in critical models involves comparing the data generated from your development branch with production (or staging data). The process looks like:

  • Identify critical models — Models which are bottlenecks or are important to business
  • Decide which checks to run on these models — Data should not change in these models so you should run structural and data profiling checks
  • Automate the checks — Run the checks in CI with each PR
  • Review check results — As part of PR review check if there is impact, and if it’s is intended or not

Automating Critical Model Checks

Critical models checks should run with each PR, here’s how you can do that with the preset-checks feature in Recce, by committing a checklist to your data project repo.

Let’s say you want to have the following structural checks run with each PR for your customers model:

  • Row count — You shouldn’t lose any data.
  • Schema — The schema should not change unexpectedly.

These are fundamental checks that all critical models should have, regardless of the type of PR.

Schema and row count checks performed in datarecce.io

You can run these checks in the Recce UI, and any of these checks can also be automated in CI. Here’s how:

1. Generate your check file (recce.yml) and commit to your project

To add the row count and schema checks to your dbt CI job you just need to commit a Recce checklist (recce.yml) to your dbt project. From the Recce UI you would:

  1. Add the check to your checklist.
  2. Copy the preset check template to your recce.yml
Recce.yml in the dbt project root

Then commit the recce.yml in the root of your dbt data project. Each branch will get a copy of these checks, and Recce will run the checks each time a PR is opened.

2. Run Recce in CI

When Recce runs in CI, the checks in recce.yml will be automatically run. Recce also provides a command to generate a PR summary with your results, which can also be automatically added to the PR comments for reviewers to check.

Run Recce in CI

3. Review the results

The Recce summary, posted to your PR comments, shows you the following things:

  • A diagram of the impacted lineage — this is just the part of the lineage that has modifications, or is downstream of modified models
  • A list of checks that detected a data mismatch
Review Recce CI Summary

If there was a difference in the row count, or the schema changed, on the customers model, then those checks will be listed here.

The idea behind the Recce Summary is ‘all signal, no noise’, which means you only want actionable or useful information — You only want to know when there’s a difference, if everything is the same there’s no point telling you that, it’s just noise that would crowd out the signal.

4. (optional) Data impact exploration

If one of the checks indicated a data-mismatch, or you’ve otherwise seen something that you want to inspect, you can download the Recce state file and run Recce in review mode to see the results of the checks.

Download the Recce state file and execute Recce in Review Mode

To perform live checks on the data, you need only add a dbt_project.yml and profiles.yml, there’s no need to checkout the whole dbt project. Then you can perform live checks to compare dev with prod/staging.

Conclusion

Every data project will have those critical models. It’s fine to check them ad-hoc while you are working on the data project and validating the data generated from your changes. But it’s best practice to identify and automate checks on these critical models.

  • Identify your critical models
  • Curate a checklist for things you know should not change in these models (project specific)
  • Run this checklist in your PRs with Recce
  • Review impact if necessary

--

--

Dave Flynn
In the Pipeline

Dave is a developer advocate for DataRecce.io — the data modeling validation and PR review toolkit for dbt data projects