Data testing & data monitoring, do those need to be separate things?

Published in

re_data

4 min readDec 31, 2021

Looking at the current state of open-source & commercial solutions helping companies ensure data quality, you may have an impression that you can take two very separate approaches.

Data testing approach

Currently, the most popular approach to deal with data quality problems is data testing. The idea for data testing is quite simple: You should test your data the same way you would usually test the software by writing tests that will run on tables & files, etc., and fail if something unexpected happens.

There is a couple of popular open-source libraries helping you do that: Great Expectations, Deequ, dbt (in dbt, data testing is just one of the features, other libraries are primarily designed for that).

Apart from more specialized solutions, there are other ways of running checks on your DB. For example, Apache Airflow has operators letting you run any SQL check. It doesn’t give you Great Expectations style asserts — but when you think about it, many of them are pretty simple SQL scripts possible to write yourself.

Data monitoring approach

There is another set of solutions in the data quality/data observability space. If you ever heard of Monte Carlo or BigEye, those are SaaS solutions solving very similar problems (your data getting bad). They don’t rely on data testing but statistics & ML in detecting problems. They usually mention that although data tests are good, they are not enough to ensure data quality. Here is one blog post from Monte Carlo about that. The two things mentioned in the article on which monitoring solutions are better than data testing:

Detecting unknown unknowns — problems you are not expecting to have when writing data tests and therefore not testing for it
Good testing coverage is hard & costly to get as amount of data grows. With hundreds/thousands of tables, testing becomes a laborious process — on the other hand, monitoring can be set up in a way that, at the start, it covers all your tables.

There is an analogy to the software engineering world, when you most often need both tests and monitoring solution setup. There are possible problems you would encounter without that.

We believe there are more reasons why data tests may be not enough.

The world is not black and white

Data tests always assume that data either meets or doesn’t meet some criteria. But sometimes, you want to follow specific metrics on your data and investigate yourself what exactly is happening (usually when alerted or after looking at visualization).

A simple example of a metric that many people want to track, but tests for it (even if written) are not the only indicator of a problem, is a daily total_row_count metric for your tables. Those may vary for many different reasons. Usually, to know if there may be an issue, you need to compare to past data, trends, etc. Testing alone can usually be useful for checking expected threshold of values, but just doing that is not giving you the whole story (you don’t have history comparisons) plus, it’s hard to set up thresholds this way so that they both:

catch problems
are not too noisy and fire up too often

Merging it

Ok, so summing this up, you may be convinced that you need both data testing & monitoring in your stack. Does it mean that you should deploy a pipeline using Great Expectations & buy a Monte Carlo solution?

In our opinion that’s not necessary: although data testing & monitoring are both good, you don’t need a separate solution for those.

re_data is a framework that combines both data testing & monitoring (we also don’t want to stop here in regards to helping with data quality, but will speak about this at some other time).

What re_data does well is, letting you create time-based metrics about your data quality. Those can help you with solving both:

Data monitoring — as a lot of metrics are built-in, and it’s easy to compute them for all your tables — visualize in BI tools, and re_data will look for anomalies in them
Data testing — as you can test metrics (not only those created by us — but also your custom ones), and as re_data is a dbt package, it can be used together with standard dbt tests you are writing.

Using metrics in data tests is not a new idea. The team behind Deequ researched it. Also, this is how internally GE or Deequ compute a lot of tests. Still, sadly these frameworks expose metrics to users in a very limited way, making it hard to make proper use of them (for visualization, anomaly detection, etc.).

There are some other benefits from making users more exposed to data metrics. In this post, let’s focus on one of them.

Testing metrics is faster.

Why? Especially for dbt tests which are simple and not optimized for efficiency, it’s easy to get data warehouse overwhelmed with computing tests.

When computing metrics, it’s possible to do a bulk of operations in one query vs. running each test separately. Not convinced? Try comparing the running time of:

select max(el) from table union all
select min(el) from table union all
select avg(el) from table

To:

select max(el), min(el), avg(el) from table

Just for three attributes, you should be able to see a big difference. In reality, you usually also want to test many different columns, and quite often, where part of the query doesn’t change for it. As in many places, bulking operations (in this case to metrics) make computation much faster.

Conclusion

At re_data we don’t believe the currently popular approach of doing tests or/and monitoring data as a separate thing makes a lot of sense.

We believe creating data quality metrics can help you with both — data monitoring and testing and, if supplemented with some SQL tests run in dbt (which we love), can give you the best solution for your data observability.

Let us know if you think we missed anything. We are very responsive on our Slack. 🙂 Our goal is to see things from many perspectives.