‘Thoughtful PR Review’ is now a requirement for all data jobs
How do you review your data project PRs?
Experience performing comprehensive PR review is now a required skill for all data roles, from data engineer to data scientist. Your data team is probably adopting software engineering practices like version control and formalized PR review but, there’s a twist, you’ve got to review the data, as well as the code.
I saw this post on LinkedIn recently by Nirmal Budhathoki, a senior Data Scientist at Microsoft, that discussed Git and PR review as ‘realistic’ skills that data scientists need to adopt:
…we came across a unique DS job (ad) that seems to be interesting and more realistic…It actually talks about GIT and PR reviews. This is one of the skills data scientists need to inherit from software engineering…
— Nirmal Budhathoki (Senior Data Scientist @ Microsoft)
Nirmal really hit the nail on the head. Good PR review is taken for granted and, the truth is, having the ability to understand data impact caused by code change is a core skill for any role working in a data team regardless of if it’s explicitly mentioned in a job ad.
Why PR review is so important for dbt data pipelines
When reviewing a PR, think of ‘PR’ as meaning ‘Point of no Return’ — as the reviewer you’re the last checkpoint before code is merged and prod data is changed.
The PR is the point of no return, the fail safe point, after you pass this point it’s ‘prod’ or bust!
Merging bad data into prod could mean days of downtime while you roll back code. This is in addition to the impact that the bad data had on the business. You need guardrails in place to validate changes and check the data before signing-off and merging code — This is what PR review is for.
Self-serve teams are growing
You’ve probably found that your data team is growing and more people from different roles are accessing your data project. A PR could come from anywhere, so reviewing it is a complex task:
- Did the submitter do any validation?
- Is the SQL optimized?
- Is the business logic correct?
- What’s the impact on the data?
The increased activity in your data project, and self-serve nature of dbt, means that PRs-per-week (PPW? shall we coin a new initialism?) is on the rise.
PRs-per-week are up
Analyzing some public dbt projects, peak PRs per week range from 9 for Open Source Observer, to 11 for Mattermost, and 18 for Cal-ITP. I’ve also spoken to some data folk who reported up to 100 PRs per week.
These are just a couple of reasons that better PR review is essential. The challenge now is to stay on top of the PR backlog, while still maintaining high quality review of each PR — and catch errors before merging to avoid firefighting afterwards.
How to Review PRs for dbt data projects
Ideally, you should have buy-in from the whole team, and implement a dbt pull request template that will enable your team to replicate a comprehensive PR review each time, with a predefined checklist of tasks. If you don’t have buy-in, and you’re left as the gatekeeper to merges, you can still add guardrails and validate yourself.
Intent and expectation
If you boil it down, you get two things from a good PR comment: The intent of the author, and the expected impact. It’s the expected impact that you need to validate, and this is where the unique aspect of reviewing a PR on a data project comes into play — how to exactly understand the data impact. You know what to look for, and what you shouldn’t see.
You need more than dbt tests
You probably already have a series of dbt tests for your models. They’re an essential aspect of data quality and consistency, but maintaining them can quickly get out of hand, and you’ll end up implementing complex custom solutions to manage triaging alerts. Plus, you’ll quickly realize that dbt tests are not enough — They can tell you if the data meets some predetermined requirement, but even if dbt tests all pass, does that mean the data are “correct”?
Complex tools are complex
dbt_audit_helper and dbt_profiler can help, but configuring them for each project and PR is a lot of work, takes a lot of time, and it doesn’t scale. Configuring these tools for every PR would be a full time job in itself, which is especially true if you want to compare data models from prod with your dev branch (before and after modeling changes).
dbt data validation checklist
This is where Recce comes in. Recce helps you validate data impact during development and as part of PR review by providing the mechanism to compare data or queries between two dbt environments and:
- Create a checklist of data validations with annotations for added context
- Save and re/run those checks in CI
- Provide an automated environment to review checks and drill down into data when necessary
Data impact visibility and context
Your data team can use Recce to get visibility into the impact of data modeling changes before merging into production by checking and comparing data from your development branch with production (or other known good data). As the PR reviewer, this gives you the confidence you need that a merge isn’t going to break prod.
- No more reviewing PRs without context
- No more struggling to craft queries to compare your data
- No more firefighting data issues from bad merges
CI Automated Checklist
Run Recce in CI with your preset checks and Recce will attach a environment state file to your PR comment.
- Download the state file
- Run Recce in Review mode
- View your data validation checklist
Manage multiple PRs
Using this method you can quickly and efficiently review multiple PRs without needing to set up the dbt project, or configure any complex tools.
Let’s say you had 10 PRs to review. All you need to do is:
- download the Recce state file for each PR
- run Recce Server specifying the state file
- review the checklist
- load the next PR
- repeat.
If you need to perform ‘data drill-down’ to find the root cause of a data issue, just add your dbt project’s profiles.yml
and dbt_project.yml
and Recce will connect to your data source so you can compare dev and prod in a live environment.
Where to get Recce
Recce is open-source, and you can get started today by installing the page with pip:
pip install -U recce
Here’s some links to get you started:
- 5-minute Jaffle Shop tutorial
- Use Recce in CI
- Online Demo