Automated testing in the modern data warehouse
Toward a testing philosophy for the data warehouse
Over time, software engineers have developed a strong philosophy for testing applications. Concepts like unit testing, the test pyramid, code coverage, and continuous integration have made application testing robust and have established solid design patterns. Good testing practices are taught and practiced in most computer science programs.
In my experience, a unified testing philosophy is missing in the data world. As a data professional, I tell people that my goal is to provide accurate and timely information to enhance decision-making. However, if I supply our decision-makers with inaccurate data, they might make far-reaching, strategic mistakes. If our website goes down, it’s certainly an issue (we could lose customers, revenue, and our brand could be harmed), but generally bugs in application code don’t have as pervasive consequences as bugs in data. If data issues have lasting consequences, why are we less sophisticated at testing than our software developing counterparts?
I think there are a few reasons for this.
- Many data professionals (myself included) were not formally trained as software engineers, but entered the field out of curiosity and the insatiable demand for more data people. When I started my first job as a data analyst, I knew I probably should have been testing my code more, but I didn’t have a proper understanding of how to do it.
- Testing data is hard — data is complex, interrelated, and changes constantly. Tutorials on unit testing Python class methods didn’t feel relevant to my SQL queries and ETL jobs. In fact, many of the testing paradigms developed for software applications break down when applied to databases.
- Testing can feel like a nuisance. It’s extra work to write and maintain tests, and it can make refactoring harder if you have to refactor your tests in parallel.
As a junior analyst, what did I do instead? I usually tested by visual inspection — that is, I would write a
select * query, check that the first 10–20 results looked normal, and move on. By doing so, I made the implicit assumption that the rest of my data looked basically the same. When we test this way, we’re prone to make incorrect assumptions about our data. We might assume that a column only contains values in a given range, that a join didn’t create fan out, or that a foreign key exists as a primary key in the table it references.
These faulty assumptions can have ripple effects, especially when we write downstream code based on them. For example, let’s say we assumed that the
order_id column in our
orders table is unique. Perhaps we inadvertently introduce a bug into our load job that loads a few duplicate orders from the previous day. If we don’t test our assumption, we would be slightly over-reporting product sales — a slippery bug that may not even be noticed by our end users!
The purpose of testing is to validate our assumptions. Good tests catch bad assumptions. Every time you make an assumption about the state of your data or the result of a query, you should write a test to confirm that assumption.
A successful test means we can be more confident that the assumptions we’ve built upon are firm. Of course, it’s impossible to test everything, but with each test, we decrease the risk of bad assumptions that lead to bad decisions.
Test all the things
If we want to be as comprehensive as possible with our testing, where should we begin? Logically, when testing a process, you should test its inputs, outputs, and any intermediate stages within. In our case, we should test at the boundary of our warehouse, where data enters and exits, and we should test all internal transformations within. This gives us three specific stages of our process to test:
- Input. Extract and load jobs that insert data into or update data in the warehouse. Does the input data to our process match our assumptions?
- Transform. Queries that transform data within the warehouse. Does our process create new data from input data in a way that matches our assumptions?
- Output. Queries that extract data from the warehouse for other purposes (e.g. a business intelligence tool or predictive model). Does our process give us the data we assume it will, in a format that is useful?
This leaves us with our basic approach to data warehouse testing: test assumptions about data entering, being transformed within, and exiting our warehouse. But how do you actually write a data test?
Anatomy of a data test
A data test, which you can compare to a unit test in software development, should have the following characteristics:
- Specific, so it’s clear what a test failure means. One test that tests three conditions will always be harder to debug than three tests for individual conditions.
- Automated, so it can be put into a test suite and run easily. Having an automated test suite means you can quickly assess the data warehouse-wide impact of introducing new SQL.
- Fast, so you’re not waiting forever for the the test suite to finish. If a test takes longer than a minute or so, I’ll attempt to break it up into multiple tests or I’ll test a smaller sample.
- Independent, so it can be run at any time and in any order.
As an example, one of the most important tests you can implement is a uniqueness test for your primary keys. Data duplication can easily occur across your data warehouse due to errors in loads or improper joins. At Milk Bar, every table in our warehouse is tested repeatedly for unique primary keys.
The query to check for duplicate primary keys is simple.
where my_primary_key is not null
group by 1
having count(*) > 1
If this test returns any rows, we consider the test to have failed and we go fix the duplicate rows. You could also
select count(*) and fail the test if the count is greater than zero.
At Milk Bar, we use dbt to manage our data transformations and as our test framework. dbt is a powerful command-line tool that extends SQL with Jinja templating and allows us to build a large test suite of stock and custom tests with very little effort. With dbt, creating and executing the uniqueness test above is as simple as writing the following configuration and running
- name: my_primary_key
dbt is hands-down the best framework I’ve found for creating a robust, automated test suite for your data warehouse and I highly recommend checking it out if you’re serious about upgrading testing in your data warehouse.
Deciding what to test
Deciding what to test can be difficult, as it requires you to anticipate what could go wrong with your data. This is even more challenging when you’re testing code that you wrote, where you might have blind spots.
I prefer a risk-based approach to testing, which basically says: prioritize testing whatever is most likely to fail with the greatest impact to the business. We think about risk as
probability of failure * impact of failure. A null value in a column you don’t ever use doesn’t pose much risk to the business, so it’s reasonable to consider not testing it in favor of a faster test suite. Remember, testing is all about assumptions, so start by attacking the assumptions that would really backfire if you were wrong.
I generally find reading exhaustive lists of test categories to be less helpful than thinking through my data models on my own. Take some time to brainstorm how bad input data could break things, or how buggy transformations could affect calculations of important metrics. With that said, here are a few important assumptions we test every day at Milk Bar as part of our data loads and during development.
- Is the primary key in this table unique?
[important column name]free of null values?
- Do these foreign keys exist as primary keys in their referenced tables (referential integrity)?
- Does the numeric data in this table fall within an expected range?
- Do item-level purchase amounts sum to order-level amounts?
- Do the necessary columns for external work exist in the final table?
- Can we preemptively run popular dashboard queries from our business intelligence tool?
Continuous integration and beyond
Once you’ve built out a solid test suite and are confident you’ve covered your high-risk assumptions, you should absolutely consider adopting continuous integration (CI). CI is where automated testing starts to get really powerful.
The idea behind continuous integration is that it’s better to test and merge new code into the master branch as soon as it’s written than to wait until “release day” when every developer is attempting to merge an entire sprint’s worth of work at the same time. This accelerated integration is made possible by a comprehensive, automated test suite — if my commits pass all tests (and probably a bit of manual code review), I can be confident that merging them into
master should be safe.
At Milk Bar, we use GitLab CI/CD and dbt to manage this. When I push a series of commits to GitLab, GitLab CI/CD launches a pipeline. This pipeline tells dbt to build a replica of our data warehouse in a staging schema and run our test suite against it. If the tests pass and we approve the merge request into
master, GitLab CI/CD triggers dbt to automatically rebuild the production schema using the new changes. We’re still sorting out the best approach, but so far it’s been an incredible workflow. I can commit modifications to a SQL query, push to GitLab, wait for the tests to pass, merge it, and know that my changes will be live in production in a few minutes. CI makes my development cycles faster and increases my confidence in the commits I’m merging into production.
Designing an effective CI pipeline for your data warehouse deserves its own post, but know that the setup is remarkably useful and not as hard to configure as you might think.
All tests passed!
As people in the business of providing accurate information, we should strive to push the limits of automated testing in the data warehouse and beyond: extract and load (EL) code, data science models, business intelligence, and more. Testing our assumptions will make us better engineers and analysts and will increase confidence in our work and data trust across the business.
If you have an interesting approach to automated testing in your data warehouse or suggestions to improve this one, I’d love to hear about it in the comments!