The dbt PR-reviewer’s nightmare

How can you review data model updates without context?

Dave Flynn
In the Pipeline
3 min readMar 13, 2024

--

There was a poll on LinkedIn recently that asked ‘Who uses dbt the most?’. The optional answers were:

  • Data engineer
  • Data analysts
  • Data scientists
  • Show me the results

As of writing, 72% answered data engineers out of 1400 votes. That’s a huge percentage voting for DEs, and I wouldn’t say it’s “wrong”. Data engineers do use dbt a lot, but immediately there were comments below the poll talking about the responsibilities of various roles and the absence of ‘analytics engineer’, dbt’s newly defined role, from the options.

Spiderman meme with 7 Spidermen all pointing at each other. Who’s the real Spiderman? Joke is about a dbt project having multiple contributors — who’s really using dbt the most?
Who’s got their fingers in the dbt pie?

Who’s fingers are in the dbt pie?

The actual right answer? Well, it honestly depends on your company, and the culture of your data team. The fourth answer should have been ‘all data people’ because there really are many teams in which all roles have their fingers in the dbt pie.

It’s easy to jump to the conclusion that the data engineer might be gatekeeper of the dbt project. That is the case in some teams, with the data engineer being responsible for maintaining the pipeline, while also responding to requests for data updates from analysts and other teams.

It’s also very common to find teams in which any data role can help themself to a new transformation when they need it, or update some logic to meet their new requirements.

In both situations, if you’re the PR reviewer, you have the potentially daunting task ahead of reviewing a PR without being close to the context of the changes.

The PR reviewers nightmare

Here’s a nightmare scenario -

  • A data engineer receives a request from a data analyst to update the calculation on a a model. They make the change, but don’t have the context to know if it’s correct. They can’t properly validate their work.
  • Round-robin PR reviewer selection lands on you, or you’re the only person available online at the time, but you’re also not close to the business logic. How can you review it?

Hopefully you never find yourself in such a situation, but the key in those cases is to make sure then changes didn’t cause any adverse impact. You do that by ensuring data that should stay the same, is the same. If there are/should be changes, then you need to be able to see them, so you can seek out the context and ascertain if the impact is correct.

Conclusion

It’s common for large teams to all have access to the dbt project. Changes could come from anywhere. In such situations it’s impossible to have all the context all of the time. This makes validating changes and reviewing data project PRs extremely difficult. Sometimes, the only real way to know that things didn’t break, is to check that data that shouldn’t change, didn’t change, and maintain the stability of the pipeline.

It’s time for better PR review

We’re working on Recce, it’s a data validation toolkit for dbt data projects, especially designed for making better PR review model validations. With Recce you can perform multiple checks on your data prior to merging to compare dev and prod branches to ensure that the PR you’re reviewing isn’t going to screw up production data. No more downtime and overtime debugging the root cause of a bad merge, because you’ll have done your due diligence in the PRe review process.

--

--

Dave Flynn
In the Pipeline

Dave is a developer advocate for DataRecce.io — the data modeling validation and PR review toolkit for dbt data projects