When does proper data validation become a ‘must have’ for dbt projects

Dave Flynn
In the Pipeline
Published in
4 min readMar 18, 2024

There are some data teams that modify dbt models and merge into production without any testing. Some don’t build locally, opting instead to let CI run and dbt build fail if there’s an issue. At the other end of the spectrum, some teams follow strict pull-request policies, with PR templates and custom CI workflows to ensure they know what’s happening in each PR.

There are various levels of maturity in dbt projects and data teams cultures that, in turn, dictate the kind of behaviors that you’ll see. It’s always advisable to follow best practices (not doing so will set yourself up for a fail), but there are teams dealing with different data for different uses, and at different stages in project maturity.

Crossing the threshold to data culture maturity

For those teams dealing with non-critical data, the upfront checks might seem unnecessary. While teams handling business critical data will consider the extra time spent to QA pull requests as absolutely essential. There’s a threshold that’s crossed, however, when proper data validation goes from something that’s ‘nice to have’, to something that’s a ‘must have’.

When proper validation becomes a ‘must have’

Everyone hates downtime from bad merges and debugging data issues, but there are some key situations when you should consider enforcing more strict data checks and PR processes before merging into prod.

Screenshot of the machine gun scene from the 1966 Western, Django. The machine gun is ‘proper data validation’, the bad guys are ‘silent data issues’.
Proper data validation before merging takes care of downtime and silent data issues later on

Reigning in a Wild West data culture

If you’re the data engineer responsible for the stability of the pipeline, the last thing you want is to merge untested branches. This is basically working blindly and hoping for the best. If you work in a team that doesn’t build and test projects locally, and downstream usage is regularly impacted, then it’s time to consider implementing stricter dev-time validation checks, or CI checks. At the very least, implement PR comment policy as a first baby step.

Multiple roles accessing dbt

This is becoming a common reason for adding data model validations and other data checks. When the data team grows, and multiple roles, all from varying data backgrounds, start to access and modify the dbt project, enforcing better PR practice and adding CI checks can be essential in securing the last mile of the pipeline.

dbt project complexity

In a similar manner to a growing team, a growing project also requires better data validation process. Once a project reaches a certain scale and can no longer be fully conceptualized in the mind, you probably need formalized validation techniques to ensure you don’t break something you were unaware of, or that slipped your mind. The more complex the project, the more time it’ll take to repair and find the root cause of issues from a bad merge.

Business-critical data

When you have critical downstream uses, such as ML models or data-driven business decisions, and any downtime is not acceptable. This is where critical data applications differ from the previous reasons that you might seek our data validation checks — it’s not just the cost of the time it takes to fix, but also the cost to the business in terms of bad decisions, inability to make decisions, trust in the business etc.

Do your data due diligence

Whether it’s growing teams, growing projects, or critical data, it’s better to perform your due diligence and QA data project pull requests to catch data problems before they enter production, rather than risk the potential fallout and responsibility of the damage bad data could cause.

Implementing best practices doesn’t come without its costs, though — Time costs, that is. As the Mattermost project shows, average time-to-merge was slowed down from half a day, to 3 days. WHat this shows is more care in merging into production. At some point, the team at Mattermost must have decided, the data downtime is not worth the risk, and the extra time spent pre-merge is worth it in the long run.

Other Articles from In The Pipeline

--

--

Dave Flynn
In the Pipeline

Dave is a developer advocate for DataRecce.io — the data modeling validation and PR review toolkit for dbt data projects