Shift Left and Maintain Context — Improving the Pull Request Review Process for Data Projects

Dave Flynn
In the Pipeline
Published in
5 min readDec 7, 2023

We’ve been working on PipeRider for over a year now, and while we think that Impact Assessment is the right way to go, there is still lots of room for improving the quality assurance (QA) process for pull requests (PR) and fixing that ‘last mile’ before code (and data) enters production.

Speed up time to merge with a checklist of impacts for review

The unique challenge for data projects

There’s been a lot of talk recently about software engineering (SE) best practices and how to apply them to data projects, and this goes a lot further than just source controlling the transformation code. Data projects provide a unique challenge in which not only the code needs to be reviewed, but also the resulting data and, for critical applications, that data needs to be checked against production data for unintended impact. Possibly most importantly, the reviewer needs to know that all impacts were accounted for, otherwise the whole process is a charade.

What your current PR process probably looks like

To improve the PR process, we need to start by looking at the challenges that teams attempting to implement SE best practices are facing, which starts from how modeling changes are communicated along the process, and how the reviewer can understand the intention of the developer and the meaning of data impacts.

If we split the roles up for people who are involved in a PR we get:

  • Developer
  • Quality Assurance
  • Pull Request Reviewer

The roles may or may not be distinct people, and the lines between roles may blur, but for each task the person wears a different hat with a different set of responsibilities.

What most likely happens now, goes something like this:

  1. The developer makes a modeling change (testing the changes with ad-hoc queries) and submits a pull request
  2. Quality assurance person checks the data impact (runs queries against production data)
  3. Automations run, including tests and other programmatic QA
  4. The reviewer checks everything before putting the final stamp-of-approval on the merge

What’s wrong with it

It’s a good process, but it’s not foolproof. When the reviewer comes to check everything they are faced with:

  • No way to know if the QA checks cover all impacts — what was the complete impact radius of the changes, and do the QA queries cover all of this?
  • The QA stage can be slow and error prone
  • Standard tests are lacking — tests are problematic to keep updated, and they don’t help you to pin-point the problem
  • Too much noise to signal ratio from the automated checks — What does x rows difference between production mean? Is it acceptable? What can they do with a list of 50 models possibly affected by a modeling change?

A better solution

To successfully sign off on a pull request, the reviewer needs the following things:

  • Context for review — what was the intention of the developer, what was the expected impacts, what QA has taken place and what were the results.
  • A way to visualize the zone of impact — Just checking the code isn’t enough, it’s better to see the impacted lineage.
  • A prioritized list of actionable impact — as opposed to the laundry list of models or ambiguous list of changes.

How do we get there?

Shift left, maintain context

The solution starts by shifting left, and starting with the developer. As the developer knows what they are doing, and usually has some expectation about how their changes will impact data, the QA queries for reviewing changes should start here.

A checklist of impacts

During development, the developer is testing their changes on-the-fly, essentially answering the questions that the QA/reviewer will ask later. We need a way to save these ad-hoc queries so they can be run later, as part of QA, on production data, even automate them, so that QA has the results on hand. QA can run the queries again, tweak them, add more, further bolstering the ad-hoc testing. The queries should also be annotated by the developer — if some change has taken place, the developer can explain the impact, and provide the context for reviewers down the line. These query-diffs and checks can become the beginning of a checklist of impacts.

Intelligent impact assessment

The next part of the impact-assessment puzzle should come in the form of automated-assessment. As mentioned above, though, a laundry list of models or random differences isn’t actionable. Based on the modeling changes, it should be possible to automate the process of prioritizing which models might need inspection. For instance, those with critical downstream uses, defined exposures, or many children/parents as a start.

Keep it in the comments

All of this should be handled within the pull request comments, to help maintain context throughout the process. If the reviewer wants to check the results of a a QA query, they should be able to run that and see the results. If they want to drop into the data and perform their own query-diff between production and the development data, this should be a seamless experience.

The magic will come in the mechanism that puts it all together and packages it in a way that it can all be managed from the pull request comments.

We’re working on it

It’s been a long road filled with much research, interviews, trial and error, and we’re still going. Our aim is to provide the mechanism to implement the best possible experience for the reviewer to understand the context and impact of modeling changes on data, and be able to sign off on those changes knowing that production data is not adversely impacted.

We’re really interested to hear about your experiences reviewing pull requests and how you’re finding that process — is bad data or code making its way into production, even though you’ve got tests running in CI and QA stage?

If you’re interested in working with us to craft the solution, leave a comment below, or email us at product@infuseai.io

--

--

Dave Flynn
In the Pipeline

Dave is a developer advocate for DataRecce.io — the data modeling validation and PR review toolkit for dbt data projects