The Rise of Data Monitoring

New Tools to Monitor What’s Actually Happening with Your Data

Written with Edwin Ong while working on Basejoy.

One of the most exciting recent developments in data engineering is the emergence of a new class of data monitoring tools. In the last decade, companies like DataDog, NewRelic, and Sentry have automated many of the chores associated with infrastructure and application monitoring, improving the work lives of DevOps engineers. In the coming decade, a new class of data monitoring tools aims to do the same for data engineers everywhere.

What Is Data Monitoring?

Example data health checks include freshness (to make sure that new data is coming in appropriately), volume (to make sure that the size of the overall data is as expected), formats (to make sure that data types correctly reflect expectations), and outliers (to make sure that no value is too small or too big).

Why Is Data Monitoring Important?

Image by Elizabeth Maki

To put it in Monica Rogati’s classic AI Hierarchy of Needs paradigm, data errors can occur in data collection (ex: via a web scraper that is fetching the wrong CSS selector), data flow (ex: via a hosted ETL tool that is turning empty strings into nulls), or even data exploration/transformation (ex: via a bug in custom data-cleaning code). As a result, the errors will propagate towards data aggregation (ex: erroneous “intelligence” that results in a media buy that targets the wrong segment) or worse, hard to diagnose errors in learning/optimization (ex: why does my “gender”-prediction ML service think Rachel Maddow is a man?”). The ability of data errors to cause financial and reputational harm should not be underestimated.

Data Monitoring Solutions

Elements of an Automated Data Monitoring System

While there are differences in these tools’ approaches, they typically consist of 3 core parts: (1) a data collector that connects with the user’s data store, (2) specific health checks (such as volume, freshness, etc) that run on the connected data, and (3) a dashboard/alerting system to let users observe and act on the overall health of their data.

Image by Elizabeth Maki

Health Checks: Rules vs ML

Health checks are generally built using either rules or machine learning. To detect data freshness issues, for example, an explicit rule such as “the newest row in this table should never be older than 5 minutes” may be defined. Instead of explicit rules, the machine learning approach uses anomaly detection to surface issues instead, figuring out what’s atypical by training on historical data.

The advantages of the rule-based approach are its deterministic nature and the ability for the system to start working immediately. When an alert is fired, a user knows exactly what the issue is. The system does not need historical training data to start working. However, it can be time-consuming and even unscalable to explicitly define all the possible health checks that are required.

For large data repositories with many tables, a machine learning approach can simplify data monitoring drastically. Also, there may be “long-tail” data issues that are not easily definable in a rules-based system that a well-performing machine learning system can catch. Despite its theoretical advantages, because most production systems are not set up for “data-monitoring backtesting,” it can be a leap of faith to entirely rely on a machine-learning approach.

Data Monitoring for Data Warehouses

Anomalo (https://www.anomalo.com)
Segment focus: Enterprise
Approach: ML-focused

BigEye (https://bigeye.com)
Segment focus: Enterprise
Approach: ML-focused with “Autometrics” and “Authothresholds”

Datafold (https://datafold.com)
Segment focus: Mix of SMBs and enterprise
Approach: Offers data monitoring as part of data regression suite

Metaplane (https://www.metaplane.dev)
Segment focus: Mix of SMBs and enterprise
Approach: Hybrid of rule-based and ML-based

MonteCarlo (https://www.montecarlodata.com)
Segment focus: Enterprise
Approach: Hybrid of rule-based and ML-based

Data Monitoring for Relational Databases

Basejoy (https://basejoy.io)
Segment focus: SMBs
Approach: Hybrid of rule-based and ML-based with RDBMS-specific health checks

Soda (https://www.soda.io)
Segment focus: Mix of SMBs and enterprise
Approach: Rule-based data-scanning solution

Open Source Resources

dbt (https://www.getdbt.com/product/data-testing/)
The popular data transformation tool can be used to verify data quality

Great Expectations (https://github.com/great-expectations/great_expectations)
Data testing, documentation, and profiling tools for data teams

Summary

Thanks to Monica Rogati for reviewing for accuracy.

Solving data problems