dbt best-practices are in, but merge-times are up

Dave Flynn
In the Pipeline
Published in
3 min readFeb 7, 2024

--

Have you noticed that your merge times have increased since moving your data project to dbt, or adopting dbt best practices? If so, it’s not a bad thing, it probably just means that you’re spending more time on PR review; and a systematic PR review is indeed good practice. But, does it have to be that way?

What’s your average time-to-merge?

A real-world example: Mattermost

One of my colleagues, Even, was looking for some hard facts about how best practices would affect the PR review process for data projects, and came across the Mattermost data warehouse repo. Public dbt projects that are used in production are few and far between, so it’s interesting to have the data project for Mattermost available to analyze.

How does adhering to best practices affect time-to-merge?

Around a year ago, Mattermost rebooted their dbt project with the aim of following more closely the suggested practices from dbt, and take a more systematic approach to handling pull requests. Even wanted to see how this change affected the stats for pull requests — basically, are there any penalties for adhering to best practices.

Here are the raw stats:

Some of the difference in stats will be related to the duration that each project was in use (3 years for the old project, 1 year for the new), but there are some stats which won’t be affected by this, and paint an interesting picture.

Good PR Review takes time

The stats certainly line up with best practices being applied:

  • Commits per PR are down by 40% — the work is more focused
  • No. of models reduced by 30% — less duplicated tables (hopefully)
  • No. of tests are way up
  • Average review comments (those tied to specific sections of code) are up — more discussion about the impact of changes

The result of all this is that time-to-merge increased from 12 hours to 3 days. So, what’s the take-away — that having a review process takes time. Yes, of course. Any process takes longer than no process. Now we have to figure out how to make that process more efficient, so we can keep the benefit that best practices bring, but do it faster.

  • How can we still have those discussions, but make sure they are resolved faster?
  • How can we remove the need for certain discussions by providing more information up-front?
  • Is there anyway to speed up the QC process on data project pull request review, while keeping more rigorous testing?

Cal-ITP (California Integrated Travel Project) is another public dbt project that applies best practices to their PR process. Cal-ITP’s process is a textbook example of all the right things to do when reviewing PRs, and in this post we take a look at a few specific PRs as examples:

How can we make the process more efficient?

A good place to start would be to look into why those threads are going so long. Is there something missing, or some process that’s taking too long? One aspect is that data projects are just inherently difficult to review — the data is always changing, it’s not easy for a reviewer to spin up the env, data impact isn’t always obvious.

The Mattermost project is only one example, so I’m interested to know if you’ve seen similar results after adopting software development best practices in your dbt data project?

Better data modeling validation can speed up the PR review process

This is the problem that we’re working on solving with Recce — Our take on the data validation toolkit that enables data engineers to compare dbt envs specifically to help with PR review.

Further Reading

--

--

Dave Flynn
In the Pipeline

Technical Advocate @ DataRecce.io — the data modeling validation toolkit for dbt data projects