Next-Level Data Validation Toolkit for dbt Data Projects — Introducing Recce

Dave Flynn
In the Pipeline
Published in
5 min readFeb 22, 2024

Validating data modeling changes and reviewing pull requests for dbt projects can be a challenging task. The difficulty of performing a proper ‘code review’ for data projects, due to both the code and data needing review, means the data validation stage is often omitted, poorly implemented, or drastically slows down time-to-merge for your time sensitive data updates.

How can you maintain data best practices, but speed up the validation and review process?

Recce — The data validation toolkit for dbt projects

Recce (short for reconnaissance) is a data modeling validation toolkit with a focus on environment diffing. Take two dbt environments, such as dev and prod, and compare them using the suite of diff tools in Recce.

Recce makes checking your work easy, and turning Recce checks into validations for PR review enables more comprehensive PR review and faster, error-free, merges.

A diff for every occasion

With Recce you’re able to validate your data modeling changes against a known-good baseline. The only real way to verify modeling changes, is to check it against historical/production data.

Lineage DAG Diff

Start from the zone of impact of your changes, and see which models have been modified, added, and removed. The dbt docs lineage DAG only shows you the current state of the DAG, Recce shows you how the DAG differs from before you made any changes.

Lineage DAG Diff in Recce

Data Profile Diff & Value Diff

Perform holistic checks by diffing the data profile stats for your development branch, then check the percentage of values matching for each column in a model.

Data Profile Diff in Recce

Query Diff

If something needs further investigation, drill down and query the data. One query will run on both environments, and you’ll be able to see the difference on a row-by-row basis. Enable change-only view to see just what’s changed.

Query Diff in Recce

Schema and Row Count

In addition to the above diffs, you can also check the schema and row count, just to be sure you didn’t lose any data, or an important column.

Schema and Row-Count Diff in Recce

Create your checklist

As you create validations, add them to your checklist with notes about what you found, and re-re-run checks if the data changes. When you’re ready, export the checks to your PR comment.

Curate your list of validation checks

Take it to the PR

When you’re ready to share your validations as proof-of-correctness for your work, you can export checks into your PR comment template. You can copy your notes, and export a screenshot of the check as it appears in Recce. By curating the validations for your PR comment, you can create an ‘all-signal, no noise’ comment with the validations that are relevant to the context of your changes.

The reviewer will be able to see the query and results of your data spot-checks and have the comprehensive information required to request further investigation, or sign-off on your changes.

Recce PR comment example

With Recce validations in your PR comment, the data team will all be on the same page about how this PR impacts the data.

Why Recce?

As mentioned above, whether you’re the pull request author, or reviewer, you’ve got the difficult task of understanding what is going on and trying to verify if the intentions of the PR were realized without screwing up production data. Here’s some common issues we’ve heard about working on large, or business critical, dbt data projects:

  • QA for pull requests takes too long — stakeholders want to merge new data features faster.
  • dbt build worked, but the data was actually wrong.
  • I’m sick of downtime from silent errors making it into prod.
  • CI takes too long and current data quality tools are costly to run.
  • I just want to see a summary of what changed for modified models.

If any of these pain points ring true, Recce can help with the code review on your data project.

Open-source and available now

Recce OSS is available on GitHub now. Follow the instructions in our Getting Started guide to start using Recce to validate your data modeling changes.

Try Recce Online

If you want to try Recce out without having to install, check out the demo instance below.

Demo

The demo PR makes a simple change to the dbt’s Jaffle Shop project and changes how customer_lifetime_value (CLV) is calculated (fixes it to only calculated completed orders).

Can you validate this code change using Recce?

The expectation from this change is that CLV will be reduced overall, and that this will also impact the customer segments downstream model, With that in mind, see if you can determine if the if the PR has any issues by checking the data in Recce:

Hint: Run a Profile Diff, then a Query Diff, on the customers model, then check for downstream impact.

Hands-On Data Impact Analysis for dbt Data Projects with Recce

For a hands-on guide through using Recce to perform data impact assessment, check our latest article:

--

--

Dave Flynn
In the Pipeline

Dave is a developer advocate for DataRecce.io — the data modeling validation and PR review toolkit for dbt data projects