Mastering data testing with dbt — part 1
We have written before that we adopted dbt to really improve our data transformation layer, (e.g. here). It had a significant impact on our efficiency but, most importantly, it has forced us to adopt best practices in our data engineering practice. One of the areas where we have invested a considerable amount of time is data testing. Up to a point a few years ago, we only made sporadic use of data tests but soon realised that it is not good enough to manage a complex platform. Regular incidents, hidden bugs and failed releases followed, which forced us to rethink our data engineering lifecycle. As a result, we have added close to 3,000 tests to our repositories, created a custom extension to the basic dbt test functionality and developed a special scheduling mechanism to balance costs and early detection needs. In this series of posts you can learn about all of these steps. In part 1, I’m going to explain our overall approach to using dbt tests and the benefits it delivered for us.
Our data pipeline architecture
To better understand the challenges we faced and the choices we made, a few words about our architecture. We manage and run daily, a number of dbt repositories (9 and growing) which make up our data pipeline. All of them have multi-tenant tables and the two core ones contain multi-source tables as well. In total, we run over 1,100 models every day. Many of them are actually run multiple times per day. On an average day the core repositories run around 15 times per day and source transformer projects between 1–10 times per day, depending on the day and the data source.
We orchestrate our workloads with Prefect Cloud and kubernetes underneath. Additionally, we created an orchestration service, which we could call a “flexible batching” service. The orchestrator looks for new data available in the landing source table and waits at least X and at maximum Y minutes before kicking off a job when new data is present, which then triggers the downstream components. This is done to try to optimise the latency and cost tradeoff, which probably warrants its own post at some point! The reason why it’s relevant for data testing is that, thanks to the flexibility of this mechanism, we don’t really know what data (which source? which tenant?) is running through the pipeline. Nor do we know exactly when the jobs will run. This introduces additional challenges to our data testing configuration.
Add tests and run tests, simple right?
Before we started adding tests to all of our dbt models, it was immediately apparent that there are a number of issues we need to address first, such as:
- What tests should be run for each model? Do we even need to test every single model? Do the generic dbt tests suffice or do we need to write custom tests as well?
- Do we need a different suite of tests at different points in the data engineering lifecycle? What should be tested during development? What should be tested after each production run? What should be tested daily?
- Should failed tests stop the pipeline?
These were all challenges identified upfront. However, once we started testing at scale, we soon discovered that thousands of tests run multiple times per day will cost us more than to run the data pipeline itself! This is when we faced into two further challenges, described in Parts 2 and 3 of this series:
- How could we test just the “affected” parts of the tables, i.e. newly inserted rows or amended rows rather than a full table?
- How could we create a more flexible schedule rather than test everything after every single run?
Data testing framework in dbt
Which tests to run?
We heavily leaned on the MoSCoW method when thinking about the kind of tests that we should run. We decided that grain tests (unique) are an absolute Must for every single model in each repository — most of the issues raised by our customers stemmed from duplicate data present somewhere in our models. What’s more, not_null and referential integrity tests are also Musts for certain columns in certain models. We carefully went through all the models and added the tests where necessary. At this point, we have also introduced must, should and could tags and started tagging each test accordingly. Luckily, all these tests are part of the generic testing suite in dbt, which made our life much easier.
Should we test all models?
We decided that all models must have tests. However, we wouldn’t test everything all the time.
What to test when?
Following dbt’s best practices means a lot of redundancy in your codebase. For example, staging models often do only minimal transformations. This means that often tests would be testing exactly the same thing between two or more models if the columns haven’t changed significantly. This is why we decided to further refine our must tag to remove that duplication. By choosing key models to must test, we are notified immediately about a bug in the system (which is what we wanted) and we can then selectively utilise the full test suite to support the problem scope investigation.
This informed our testing approach at different points of the data engineering lifecycle as well. During development, we test the affected model and its children only. In production, we must test models which will tell us there is a problem in the system. During incident investigation or when making expansive changes, we test everything. At the end of each day, we perform all the must and should tests to understand the health of our pipeline once most of the daily data has been processed.
Should failed tests stop the pipeline?
Our first reaction was: absolutely! Surely, we don’t want bad data to propagate throughout the whole system and then require fixes, clean up, etc. However, it turned out not to be that simple in our case. Given the multi-source multi-tenant set up of our data architecture, it is theoretically possible that a few rows of bad data for one tenant from one data source would delay our service for all tenants across all data sources. We thought that maybe we could automatically judge the scale of the incident and then decide whether to stop the pipeline or not.
After some analysis, we noticed that most often data issues are limited to one data source or a few tenants. Therefore, we decided to not stop the pipeline when must tests fail. However, we committed to treating each failed grain test as a suspect P1 and for the on-call engineer to immediately investigate the impact of the failure and then classify the incident accordingly. Other failures are investigated within hours and often resolved within the same day.
Test your data extensively and often
Thanks to this process, we have added close to 3,000 tests across our dbt repositories. Approximately 1,800 of them are classified as must tests and are run multiple times per day. The process of adding tests and then running them in production delivered a number of benefits for our data engineering team, the business and our customers:
- Through the process of adding tests, we have actually discovered and sorted out over 100 hidden bugs.
- We have created a detection and alerting mechanism, which means we can deal with issues as they arise rather than letting them accumulate over time.
- We have effectively stopped bugs from making it into production thanks to the improved testing regime during development and release cycles.
The effect on customer satisfaction has been considerable. What’s more, testing enabled our data engineers to be more productive — there’s less incidents to deal with and more certainty when changes are released to production.
Next parts in the series
In Part 2, we are going to write about a custom extension to the basic dbt test functionality, which we created to help us manage latency and costs. In Part 3, we are going to share how we developed a special scheduling mechanism to balance costs and early detection needs.