‘Thoughtful PR Review’ is now a requirement for all data jobs

How do you review your data project PRs?

Dave Flynn
In the Pipeline
5 min readApr 30, 2024

--

Experience performing comprehensive PR review is now a required skill for all data roles, from data engineer to data scientist. Your data team is probably adopting software engineering practices like version control and formalized PR review but, there’s a twist, you’ve got to review the data, as well as the code.

A mock classified ad from a newspaper wanted section looking for a dbt PR reviewer
Must be able to comfortably use Git and give thoughtful PR reviews

I saw this post on LinkedIn recently by Nirmal Budhathoki, a senior Data Scientist at Microsoft, that discussed Git and PR review as ‘realistic’ skills that data scientists need to adopt:

…we came across a unique DS job (ad) that seems to be interesting and more realistic…It actually talks about GIT and PR reviews. This is one of the skills data scientists need to inherit from software engineering…

Nirmal Budhathoki (Senior Data Scientist @ Microsoft)

Nirmal really hit the nail on the head. Good PR review is taken for granted and, the truth is, having the ability to understand data impact caused by code change is a core skill for any role working in a data team regardless of if it’s explicitly mentioned in a job ad.

Why PR review is so important for dbt data pipelines

When reviewing a PR, think of ‘PR’ as meaning ‘Point of no Return’ — as the reviewer you’re the last checkpoint before code is merged and prod data is changed.

Doc from Back to the Future 3 pointing at the ‘point of no return’ sign on the model train track to demonstrate the Delorean being pushed by a train to reach 88mph
Back to the Future III — Universal Pictures

The PR is the point of no return, the fail safe point, after you pass this point it’s ‘prod’ or bust!

Merging bad data into prod could mean days of downtime while you roll back code. This is in addition to the impact that the bad data had on the business. You need guardrails in place to validate changes and check the data before signing-off and merging code — This is what PR review is for.

Self-serve teams are growing

You’ve probably found that your data team is growing and more people from different roles are accessing your data project. A PR could come from anywhere, so reviewing it is a complex task:

  • Did the submitter do any validation?
  • Is the SQL optimized?
  • Is the business logic correct?
  • What’s the impact on the data?

The increased activity in your data project, and self-serve nature of dbt, means that PRs-per-week (PPW? shall we coin a new initialism?) is on the rise.

PRs-per-week are up

Analyzing some public dbt projects, peak PRs per week range from 9 for Open Source Observer, to 11 for Mattermost, and 18 for Cal-ITP. I’ve also spoken to some data folk who reported up to 100 PRs per week.

These are just a couple of reasons that better PR review is essential. The challenge now is to stay on top of the PR backlog, while still maintaining high quality review of each PR — and catch errors before merging to avoid firefighting afterwards.

How to Review PRs for dbt data projects

Ideally, you should have buy-in from the whole team, and implement a dbt pull request template that will enable your team to replicate a comprehensive PR review each time, with a predefined checklist of tasks. If you don’t have buy-in, and you’re left as the gatekeeper to merges, you can still add guardrails and validate yourself.

Intent and expectation

If you boil it down, you get two things from a good PR comment: The intent of the author, and the expected impact. It’s the expected impact that you need to validate, and this is where the unique aspect of reviewing a PR on a data project comes into play — how to exactly understand the data impact. You know what to look for, and what you shouldn’t see.

You need more than dbt tests

You probably already have a series of dbt tests for your models. They’re an essential aspect of data quality and consistency, but maintaining them can quickly get out of hand, and you’ll end up implementing complex custom solutions to manage triaging alerts. Plus, you’ll quickly realize that dbt tests are not enough — They can tell you if the data meets some predetermined requirement, but even if dbt tests all pass, does that mean the data are “correct”?

Complex tools are complex

dbt_audit_helper and dbt_profiler can help, but configuring them for each project and PR is a lot of work, takes a lot of time, and it doesn’t scale. Configuring these tools for every PR would be a full time job in itself, which is especially true if you want to compare data models from prod with your dev branch (before and after modeling changes).

dbt data validation checklist

This is where Recce comes in. Recce helps you validate data impact during development and as part of PR review by providing the mechanism to compare data or queries between two dbt environments and:

  • Create a checklist of data validations with annotations for added context
  • Save and re/run those checks in CI
  • Provide an automated environment to review checks and drill down into data when necessary
A gif showing the checklist feature of Recce for saving data validation checks
Recce Checklist: Data Impact Visibility

Data impact visibility and context

Your data team can use Recce to get visibility into the impact of data modeling changes before merging into production by checking and comparing data from your development branch with production (or other known good data). As the PR reviewer, this gives you the confidence you need that a merge isn’t going to break prod.

  • No more reviewing PRs without context
  • No more struggling to craft queries to compare your data
  • No more firefighting data issues from bad merges

CI Automated Checklist

Run Recce in CI with your preset checks and Recce will attach a environment state file to your PR comment.

  1. Download the state file
  2. Run Recce in Review mode
  3. View your data validation checklist
graphic showing the 3 steps to comprehensive data project PR review with Recce: (1) Download the Recce state file (2) Run Recce in Review mode (3) Review Recce the Recce checklist
3 steps to comprehensive data project PR review

Manage multiple PRs

Using this method you can quickly and efficiently review multiple PRs without needing to set up the dbt project, or configure any complex tools.

Let’s say you had 10 PRs to review. All you need to do is:

  1. download the Recce state file for each PR
  2. run Recce Server specifying the state file
  3. review the checklist
  4. load the next PR
  5. repeat.

If you need to perform ‘data drill-down’ to find the root cause of a data issue, just add your dbt project’s profiles.yml and dbt_project.yml and Recce will connect to your data source so you can compare dev and prod in a live environment.

Where to get Recce

Recce is open-source, and you can get started today by installing the page with pip:

pip install -U recce

Here’s some links to get you started:

--

--

Dave Flynn
In the Pipeline

Dave is a developer advocate for DataRecce.io — the data modeling validation and PR review toolkit for dbt data projects