So, you think you’ve got dbt test bloat?

Asking for a friend — what can I do about test bloat?

Published in

In the Pipeline

5 min readApr 18, 2024

We know that dbt tests alone are not enough for comprehensive data quality coverage; but when do dbt tests become too much?

In a recent talk at the Singapore/Taipei dbt meetup, Bernard Yeo (analytics engineer at Delivery Hero) talks about when dbt test bloat becomes a problem, what the ramifications are, and how they tackled the issue.

What constitutes “test bloat”?

In his talk, Bernard mentions (13:06) that at one stage, “just product analytics alone (had) 480 models, running 460 tests on any given day”.

Food panda had 480 models running 460 tests

480 models and 460 tests gives us a baseline, but, actually, this test-to-model ratio isn’t the highest we’ve seen. Here’s a couple of other high profile dbt projects for comparison:

Mattermost: 194 models / 318 tests
Cal-ITP (California Integrated Travel Project): 361 models / 941 tests

Clearly, test bloat is not an isolated situation. Though, as Bernard continues, it’s not so much the number of tests that’s the issue, as it is the number of alerts that can be triggered by the tests.

A testing situation

The result is alert-fatigue, especially when there’s the possibility that “an upstream issue could trigger hundreds of alerts”. On top of that, while alerts have owners, it’s difficult to know who should react, since the root cause might be upstream but alerts are being triggered downstream.

In such a situation it would be impossible for a team to know:

the severity of an issue,
the root cause of the issue,
who should react to the issue.

Tiers for fears

The solution Bernard’s team created involved replacing the above fears with a tiered process:

Tiered models: Not all models are created equal, some are more critical than others. Putting models into tiers allowed the team to focus on high impact issues first.
Weighted alerts: Alerts were also categorized by importance (P0, P1, P2)
Clear triage responsibilities and expectations: Owners for each level of alert were defined, along with expected actions for the alerts.

In addition to the above, they also fine-tuned dbt alert thresholds with the help of Monte Carlo.

It’s no small feat to set up such a system, and Bernard mentions that the team went through much planning and many iterations before landing on a system that worked for them.

Check out Bernard’s talk on Youtube: Managing dbt test bloat at Food Panda:

Managing dbt test bloat at Food Panda — Bernard Yeo

Test early and often (Shift left)

The reason that Bernard’s talk is so important, is that is not because test bloat is an issue — we know that already. It’s important because it focuses on:

communication between the team and stakeholders,
the team knowing how to react to issues,
scoping the issues so the team knows where to look first.

It’s these points that are core to our mission with Data Recce and, while Recce is useful for uncovering the root cause of data issues, it’s also about shifting-left and having that data impact-assessment pre-merge. So you hopefully never get to the stage of having to firefight and triage pipeline issues after the data already hit prod.

It’s a Data RPG

Consider these two quotes from Bernard’s talk:

“Data analysts know the details of their data, they are the subject matter experts, they know what is good quality (and) what is bad….
…As a data engineer, I look from a step back and can only check things like freshness, volume — A DE cannot look at the calculation and know if it’s defined correctly.”
— Bernard Yeo

Different roles have different responsibilities, and also different knowledge about the data and the pipeline — Some roles are closer to business logic, others the structural integrity of the pipeline. Understanding the roles of the team will help with better PR review, especially when it comes to understanding the intention of a change and whether the results are correct.

Intents and expectations

During data modeling, development, and in PR review, what’s needed is the way for data roles to communicate, both with each other, and with stakeholders. It’s why best practices are important in your data project.

The data team and stakeholders should know:

What is the intent of this change?
What are the expectations?
Can we be confident that expectations were met?1

This is where Recce Checks (data validations) can serve as the communication tool as either proof-of-correctness of your work, or as a way to ask your team “does this look right?”.

With Recce you can:

Maintain a list of checks — multiple data diffing and impact assessment tools are available.
Save check results and re-run individual checks — Check results can be saved for review.
Add context to checks — Provide your interpretation about the check results.

A Precog for bad merges

Recce checks work by comparing the state of data from two environments — usually this would be prod (your known-good baseline), and dev (your current working branch). Comparing these states shows you exactly how data model logic changes impact the data. It’s like a Precog for the project, and you’ll know if it’s safe to merge the PR.

By performing proper validations, and sharing the results with your team during development and as part of PR review (either manually, or in CI), you can ensure that a bad merge never happens because you can see the impact before it happens. Now you do rapid iteration on your data, without the worry of breaking prod.