When does proper data validation become a ‘must have’ for dbt projects

Published in

In the Pipeline

4 min readMar 18, 2024

There are some data teams that modify dbt models and merge into production without any testing. Some don’t build locally, opting instead to let CI run and dbt build fail if there’s an issue. At the other end of the spectrum, some teams follow strict pull-request policies, with PR templates and custom CI workflows to ensure they know what’s happening in each PR.

There are various levels of maturity in dbt projects and data teams cultures that, in turn, dictate the kind of behaviors that you’ll see. It’s always advisable to follow best practices (not doing so will set yourself up for a fail), but there are teams dealing with different data for different uses, and at different stages in project maturity.

dbt best practices in action at Cal-ITP’s data-infra project

Cal-ITP uses a standardized PR template and automated report for comprehensive PR review. See their process in-action…

medium.com

Crossing the threshold to data culture maturity

For those teams dealing with non-critical data, the upfront checks might seem unnecessary. While teams handling business critical data will consider the extra time spent to QA pull requests as absolutely essential. There’s a threshold that’s crossed, however, when proper data validation goes from something that’s ‘nice to have’, to something that’s a ‘must have’.

When proper validation becomes a ‘must have’

Everyone hates downtime from bad merges and debugging data issues, but there are some key situations when you should consider enforcing more strict data checks and PR processes before merging into prod.

Screenshot of the machine gun scene from the 1966 Western, Django. The machine gun is ‘proper data validation’, the bad guys are ‘silent data issues’. — Proper data validation before merging takes care of downtime and silent data issues later on

Reigning in a Wild West data culture

If you’re the data engineer responsible for the stability of the pipeline, the last thing you want is to merge untested branches. This is basically working blindly and hoping for the best. If you work in a team that doesn’t build and test projects locally, and downstream usage is regularly impacted, then it’s time to consider implementing stricter dev-time validation checks, or CI checks. At the very least, implement PR comment policy as a first baby step.

Multiple roles accessing dbt

This is becoming a common reason for adding data model validations and other data checks. When the data team grows, and multiple roles, all from varying data backgrounds, start to access and modify the dbt project, enforcing better PR practice and adding CI checks can be essential in securing the last mile of the pipeline.

dbt project complexity

In a similar manner to a growing team, a growing project also requires better data validation process. Once a project reaches a certain scale and can no longer be fully conceptualized in the mind, you probably need formalized validation techniques to ensure you don’t break something you were unaware of, or that slipped your mind. The more complex the project, the more time it’ll take to repair and find the root cause of issues from a bad merge.

Business-critical data

When you have critical downstream uses, such as ML models or data-driven business decisions, and any downtime is not acceptable. This is where critical data applications differ from the previous reasons that you might seek our data validation checks — it’s not just the cost of the time it takes to fix, but also the cost to the business in terms of bad decisions, inability to make decisions, trust in the business etc.

Do your data due diligence

Whether it’s growing teams, growing projects, or critical data, it’s better to perform your due diligence and QA data project pull requests to catch data problems before they enter production, rather than risk the potential fallout and responsibility of the damage bad data could cause.

Implementing best practices doesn’t come without its costs, though — Time costs, that is. As the Mattermost project shows, average time-to-merge was slowed down from half a day, to 3 days. WHat this shows is more care in merging into production. At some point, the team at Mattermost must have decided, the data downtime is not worth the risk, and the extra time spent pre-merge is worth it in the long run.

When does proper data validation become a ‘must have’ for dbt projects

dbt best practices in action at Cal-ITP’s data-infra project

Cal-ITP uses a standardized PR template and automated report for comprehensive PR review. See their process in-action…

Crossing the threshold to data culture maturity

When proper validation becomes a ‘must have’

Reigning in a Wild West data culture

Multiple roles accessing dbt

dbt project complexity

Business-critical data

Do your data due diligence

Other Articles from In The Pipeline

Next-Level Data Validation Toolkit for dbt Data Projects — Introducing Recce

The ultimate data modeling validation toolkit for comprehensive PR review that doesn’t slow down merge times

The dbt PR-reviewer’s nightmare

As a PR reviewer for a dbt project you have the potentially daunting task ahead of reviewing a PR without being close…

Written by Dave Flynn