Build trust in your data with regression tests
Have you ever encountered a situation where some KPIs from before suddenly changed? We have! But, we also figured out a way to solve this problem. In this article I will tell you how we introduced regression tests with dbt and what impact it had on our company:
Why did the numbers for my KPI from 3 months ago change?
KPI values from past periods should not change retroactively. To give an example: if you calculate the number of sessions per day these numbers should not change anymore after the day is over since users cannot create new visits for the past. Unfortunately, daily business showed us that those things actually can happen because of multiple reasons:
- Intended changes:
Intended changes can happen if old data is fixed at its source or if bugfixes are pushed in the data stack. Imagine for example a bug in the data transformation of a data warehouse setup that led to a duplication of rows due to a wrong join. After fixing this bug we can expect the number of sessions to be reduced. - Unintended changes:
Changes in the data transformation setup can sometimes have unintended side effects like filtering out certain records or creating duplicates (e.g. via data joins). Additionally, it could be that the data stack is not robust enough against changes in the source. For instance, bulk updates on source data might lead to changes in the update timestamp of those records.
The bad thing about unintended changes is that they happen without anyone noticing. They can even be a result of “unknown unknowns” [1]. Relying on business users to inform you about unintended changes in the data is a bad thing. People lose trust in the data and the official KPIs. They might even start tracking numbers in their own sheets which could lead to additional business alignment problems. We were not satisfied with this. Instead, we wanted to know about such incidents early on to be able to notify business users proactively or even prevent merging changes that could create such incidents.
When searching for a solution we found a blogpost that served as inspiration for an approach we call data regression testing.
How it works
📷 → ✅ → 🚨 → 🕵️♀️
Note: The main component of our approach is dbt. A basic understanding of dbt is a prerequisite to understand our approach.
A core functionality of dbt is creating table snapshots that can be used to track changes over time. We leverage this functionality to perform data regression tests on those snapshots in order to surface changes to historical data. At a high level, these are the steps we take to detect data changes:
- 📷 We create snapshots of summary statistics for specific tables (e.g. session counts per day) with the assumption that these statistics should not change for data from the past.
- ✅ We test that these summary statistics don’t change as part of daily test runs and in every merge request which contains changes to the dbt setup
- 🚨 In case a data regression test fails we send out Slack notifications.
- 🕵️♀️ Additionally, we have added a dbt operation that generates data diff tables for the most recent run. This supports us in debugging the actual changes more quickly.
Let’s jump a bit more into the details:
📷 Snapshots
Snapshots in dbt consist of
- the table content you select for the snapshot and
- dbt specific columns like
dbt_valid_from
,dbt_valid_to
anddbt_updated_at
(for more details see the documentation)
In order to track changes we create data snapshots using summary statistics, e.g.:
The result of running such a snapshot for the session example we mentioned before could look like this:
In case a change would happen (e.g. for the row with 2018–10–15), the old row would get a dbt_valid_to
timestamp and a new row would be created:
✅ 🚨Data regression tests
If a row has a dbt_valid_to
timestamp it means that the underlying data has changed. Thus, we can check that the dbt_valid_to
column is null
to test our assumption that this data should not change anymore.
Dbt does not provide these types of tests out of the box but we can easily add them as a macro:
In case the data does change, we get notified by the failing tests. This allows us to identify issues early on (sometimes even before changes with side effects get merged into the production codebase) and react proactively.
However, from now on the tests would fail all the time which is why we need to add some self-healing functionality. There is a nice description of how self-healing could work in this blogpost so we won’t dive deeper into that topic. On a high level, it means that we move outdated data to another table which holds those “invalidated” records so that the snapshot table only contains the latest records ( having dbt_valid_to
of null
).
🕵️♀️ Debugging changes using data diff
Whenever data changes we need to figure out what exactly happened to be able to tell whether those changes were intended or not. Having a summary statistics snapshot table like above helps us get an idea of what could have changed. However, it does not help us identify the raw entries and fields that changed. In order to speed up the debugging process, we build a data diff functionality which helps us to quickly find the entries that changed.
Details on the data diff approach (including code snippets) will be covered in a future article.
Outcomes
Implementing data regression tests in our data stack has led to a lot of improvements. Here are a few of them:
- Higher KPI reliability: Data Scientists and business users trust the data more than before.
- Early identification of data issues: We identified multiple workarounds in our CRM system and adjusted some validation rules and processes to prevent dirty data.
- Quicker discovery of bugs and missing coverage in our test setup: Failing data regression tests led to the introduction of fixes and new tests to prevent similar issues in the future.
- More predictable impact of merge / pull requests (PR): In case a PR could change data we know it even before merging the new code into production and can verify whether those changes were intended.
[1] Donald Rumsfeld (2002). United States Secretary of Defense.