The anatomy of a perfect pull-request comment for dbt data projects

Published in

In the Pipeline

5 min readDec 20, 2023

What does the ideal pull request comment look like for a dbt data project?

I’ve mentioned in previous posts about the unique challenges in applying software engineering best practices to data projects. Putting the code into version control is only the first step, there’s code-review, of course, but also ‘data-review’, and that brings its own challenges for both the the pull request (PR) submitter and reviewer.

The anatomy of the perfect PR comment for dbt (Image by author)

As the reviewer on a dbt data project pull request, you just want to sign off on it. Give your approval to merge. You need the evidence that sufficient testing has been carried out, that any impact is justified and within the scope of the pull request intention. Providing this proof is the responsibility of the submitter (developer) which, in this case, could be anyone on the data team — Data engineer, analytics engineer, data scientist, anyone who is actively developing in your dbt project.

‘Thoughtful PR Review’ is now a requirement for data jobs

PR review is the ‘ Point of no Return’ — The last checkpoint before code is merged and prod data is changed. How do you…

medium.com

PR Comment sections

Let’s look at each of the sections required for a decent pull request comment to get that approval to merge. We’ll come at this as if we are the PR submitter because, in most teams, the reviewer is also a developer on the project, and roles likely switch back and forth.

Description/Summary

You need a description that details your intent — What you planned to do and why you did it. Including the expected the outcome and impacts, together with related information that will help the reviewer understand where they should focus their attention.

Category/Type of change

Together with the description, define the type of change that this is:

Refactoring
Breaking change
Bugfix
Documentation
Project-specific item
etc.

It could be more than one.

Related Issues

If this pull request is a bug fix, or is a response to an open issue, list and link to the relevant ticket(s) here. This could be a link to the Github issue, Jira ticket, or whatever issue tracking platform you use.

Lineage

The reviewer first wants to see the zone of impact of the changes, so that’s where lineage comes in. A visual representation of which branch of the pipeline is affected and the downstream resources that might be impacted.

It’s from the lineage that the reviewer will be able to make some initial expectations about what should’ve been tested, so they can check that the necessary tests have been done.

Changed/Updated models

You should provide a list of models that have been modified, removed, or added. To reduce the amount of background noise, and depending on the size of the changes, a simple number of models might suffice, and what percentage of the project this represents. Again, this is to help to reviewer understand the scope of the changes.

Schema Changes

With schema change we want more information than just the number of models. As this is a change to the structure of the data, each model with schema change should be explicitly listed with the type of changed that occurred.

Checks carried out/How it was tested

Arguably the most important section of the pull request comment, and how you prove to the reviewer that the you’ve done your due diligence before submitting, is the list of tests and checks that you’ve carried out.

A check is the validation of any impact, or lack thereof, that you did during development. These would mainly take the form of queries that you’ve ran against the new model transformations and how the results meet your expectations. On refactoring jobs, or when critical models are in the zone of impact, you should also compare the query results from production and development branches to check for unintended impact.

These checks also help reduce the overhead for the PR reviewer, who would have difficulty (or just not have the time or context) in spinning up the development environment to check stuff. Plus, these checks are things that you would naturally do during development, so rather than waste all that hard work, you should pass on the tests for the reviewer to see.

The ideal checklist should contain:

Queries that verify data in modified and downstream models.
Accompanying descriptions about the results of any queries and what it means.
Profiling results and/or profile-diff for critical models.
Notes or explanations to cover any changes in schema, lineage, or row count etc. that was shown above.

All signal, no noise

It seems like a lot of information to include, so the important thing to remember is to only include essential information. That will differ depending on your project, but if you find yourself pasting a list of 50 potentially impacted models without context, that’s probably just noise and will do more harm than good for the reviewer trying to decipher how your PR is impacting the project.

The first half of the PR comment is for you to set the scene, define your intentions, connect any related info. The second half is the impact of your changes, with the tests and checks that prove that your intention was successfully realized.

Other potentially useful sections

Depending on your project, or type of change, there may be other sections you can use to help streamline the process, such as:

Alerts or items that might need particular attention from the reviewer. Does this change impact the project in an unexpected way? Is there something related that needs to be done before this can be merged?
Stakeholder sign-off — If any impacts occurred in exposures or critical nodes, you might need to get stakeholders to ‘sign off’ on those changes. You could list the names/tag these folk here.
Follow-up or outstanding items that need to be completed or checked after the changes are merged.
Screenshots or video demo of anything that would help the reviewer understanding anything better

Make your own Perfect Data-Project Pull Request Comment

Build you own perfect pull request with using the data modeling validation checks that come in Recce. Here’s a Loom I recorded about it. (Skip to around 2:15 to see Recce in action).

Recce is open-source and available on GitHub now:

GitHub — DataRecce/recce: PR review tool designed for DBT projects

PR review tool designed for DBT projects. Contribute to DataRecce/recce development by creating an account on GitHub.

github.com

What do your data project PR comments look like?

Did I miss anything important? Or do you structure your dbt data project PR comments in a better way? Software engineering best practices are relatively new to data projects, so if you have any tips share ’em and let’s help each other make data more reliable!