Unsupervised Data Monitoring

Jeremy Stanley

Follow

Published in

Anomalo

8 min readJan 26, 2021

--

Part 1 — Monitoring the quality of structured data at scale

To compete in a data-driven world, organizations must consolidate data into centralized warehouses and use it to enhance products and inform decisions.

Data is now a strategic asset. But how can organizations ensure they can trust the data underpinning these products and decisions?

Most data teams conclude that they need to begin testing their data — using a carefully maintained library of rules.

The Sisyphean task of creating and maintaining data quality rules in large warehouses. *Credit: Josie Stanley.*

But monitoring all of the data in an enterprise warehouse can be daunting. It is common for such warehouses to contain tens of thousands or even hundreds of thousands of tables.

Top 10 tables
Drive board-level metrics and company goals
Top 100 tables
Cross-functional data that inform product and operations
Top 1,000 tables
Detailed data owned by single teams driving process and products
Remaining 1,000+ tables
Special purpose project or feature data

The critical tables in an organization should be thoroughly tested and monitored. See our Airbnb-quality data for all post for details on how that can be achieved. But what about the rest?

If the average table contains 50 columns, and each column requires 5 rules or metrics to be well monitored, then a warehouse with 1,000 important tables requires managing 250,000 rules!

Even if each rule requires just 10 minutes to maintain each year, that would require a dedicated team of 20 highly trained data professionals (most of whom would quickly quit in protest of the drudgery):

Yet, an organization still depends upon the quality of that data. This is where unsupervised data monitoring can be critical.

Uber summarized this well in their post on Monitoring Data Quality at Scale:

Conventional wisdom says to use some variant of statistical modeling to explain away anomalies in large amounts of data. However, with Uber facilitating 14 million trips per day, the scale of the associated data defies this conventional wisdom. Hosting tens of thousands of tables, it is not possible for us to manually assess the quality of each piece of back-end data in our pipelines.

This first post in the series will demonstrate how Anomalo uses unsupervised learning to monitor data quality at scale. In subsequent posts, we will cover:

The key requirements of our system and why traditional time series and outlier detection approaches do not work
The architecture of the Anomalo unsupervised learning system — from modeling the data to explaining root causes of issues
How we define and minimize false positives, and how we benchmark our algorithm using our data chaos library

The role of unsupervised data monitoring

To illustrate how our unsupervised monitoring works, we will use a simple demo environment with just one table, public.fact_listing:

A single table of concert and sporting event ticket sales data configured in a demo warehouse.

This is a demo dataset of concert and sporting event ticket sales data. You can see that we have 3 checks passing on this table and that the checks run daily when the data is fresh.

Clicking into the table presents the table homepage:

The table home page, where new checks are added and check results are inspected.

As you can see, Data freshness has already run for this table, which determines when the data is complete each day, and automatically kicks off all other checks.

At the bottom are Key metrics and Validation rules sections where the user would leverage our time series models or library of custom validations to check their data.

In-between is the Table anomalies section, which contains two checks that are automatically configured for any monitored table:

The two unsupervised learning variants used to identify table anomalies.

Our machine learning model learns a representation of the typical data in the table. As new data arrives, it detects if that data is meaningfully different from what appeared in the table before.

We run two variants of this algorithm:

no increase in NULL values
A constrained model looking for significant increases in NULL values
no anomalous records
Our full machine learning algorithm, which identifies changes in continuous distributions, categorical values, time durations, or even relationships between columns

The first variant runs at a high priority level and notifies users when a sudden spike in NULL values is observed, as this may indicate missing data that should be fixed quickly. The second variant, no anomalous records, produces a log of meaningful changes in each table.

For example, in this Oxford COVID-19 Government Response dataset in BigQuery, in the column public_information_campaigns_flag, the value 1 almost entirely disappeared and was replaced with NULL values on October 30th.

A time-series view of table anomalies in a BigQuery public COVID dataset. The table columns are on the vertical axis and time is on the horizontal axis. The circle sizes correspond to the strength of the anomalies.

Of course, not all columns are important, and so the user can control which tables and columns they wish to see notifications for.

Finding and characterizing data chaos with unsupervised learning

To illustrate how this works in practice, let’s introduce an anomaly into this dataset. We will use our command-line client to trigger one of our chaos operations: TimeColumnZero. This introduces zero values into a column at a specific point in time.

Causing chaos in a table by inserting zero values into a column on a given date for a subset of rows.

The column numtickets in fact_listing will now contain 30% zero values on 2021–01–17 whenever the venuestate is equal to “NY.” This illustrates a common data quality issue — an invalid value suddenly appears for a fraction of rows in a key data segment.

Next, we can re-run the table anomaly checks, and we find that the no anomalous records variant fails (as it is looking for any meaningful change):

A failing table anomaly check after having introduced chaos into the table.

We can click into view details to see the explanation:

Summarizing the anomaly in natural language and evaluating its severity.

The table anomaly check has correctly identified that the column numtickets has a sudden increase in 0 values. Note that we never told it to look at this column or to look for zero values.

It also scores the anomaly's severity (this one is strong) and compares that to a learned threshold for this table, which accounts for how much background noise there is in each column. Because the severity exceeds the threshold, this check fails and is highlighted in the user interface (and could notify the user in Slack or Teams).

Scrolling down, we see a custom visualization selected based on the anomaly type and data distribution:

The algorithm chooses a dynamic visualization to help contextualize the anomaly, in this case showing the distribution of the top values in the column on the prior and current dates as a tornado.

In this case, a top values chart shows the most common numtickets values and compares the distribution on 2021–01–16 (the left bars) to the distribution on 2021–01-17 (the right bars). You can see that the value 0 was not there before and is now suddenly 10.4% of records.

Scrolling further, we can see the Root Cause Explanation, which analyzes the raw data underlying our unsupervised model to identify if there are segments of the data where the issue is most prominent:

A root cause analysis performed to identify *where the anomalous records occurred.*

As you can see, the algorithm correctly identifies that NY is where the anomaly occurred. 100% of the anomalous rows are in that state, but only 29.8% of the population rows are in NY.

This automatic identification of where an issue occurs is powerful. Without it, a user would need to examine records, trace lineages, or repeatedly query and visualize the data to isolate the issue.

The algorithm can find even more complex issues, such as when the relationship changes between columns. For example, consider this chaos operation:

Introducing a more complex form of chaos, where a single column is randomly shuffled on a date.

Here we are shuffling the priceperticket column so that the values no longer correspond to the correct rows. The actual values remain the same (and have the same distribution and mean), but the relationship between that column and other columns in the table has been broken.

In this case, the anomaly cannot be easily summarized in natural language, but we can still score its severity.

In this case, the algorithm identifies that priceperticket is the most problematic column, and that the issue is related to listid and totalprice as well.

Examining the distribution of priceperticket values on 1/16 and 1/17 shows that the anomaly is the strongest in the lowest and highest priceperticket values:

This visualization bins the continuous column into deciles. While the distribution is unchanged between the two days, we can see that the anomaly severity (color) is more intense in extreme deciles.

We can examine a sample of individual rows, where we can see that the algorithm is scoring every value in the table for how anomalous it is:

The most anomalous row in a sample of the data — with color indicating how much credit each value contributed towards the anomaly.

In this example, the $1,960 price per ticket doesn’t make any sense in the context of 28 tickets for $1,568. Plus, the price per ticket is unusually high for the Vampire Weekend show.

Zooming out, we can see this effect across many rows:

The most anomalous 25 rows in a sample of the table, again colored by how much each value contributed to the anomaly.

Now you can more clearly see which specific values of the table the algorithm believes are contributing the most to the anomaly, as indicated by the severity color scale on the right-hand side.

This granular allocation of the anomaly into the table's specific values is key. It allows our system to visualize and explain the underlying issue clearly.

When embarking on a journey to monitor and test data quality in your warehouse, it makes sense to start simple. You can use open source libraries, write your own tests, or leverage a platform like ours at Anomalo to thoroughly test your data.

But as warehouses, organizations, and testing ambitions scale, simple rule and time series based approaches fall over. They cannot effectively cover the long tail of data quality issues that commonly occur.

That is where unsupervised data monitoring comes in. You leverage a machine to learn the structure of your data and monitor for significant unexpected changes. It notifies you when a meaningful negative change occurs and presents you with visual summaries and explanations that dramatically accelerate your triage and resolution times.

Stay tuned for our next post in this series, which will explain how our unsupervised algorithm works behind the scenes.

To get started with Anomalo, and begin monitoring your data at scale using our algorithms, be sure to request a demo.

Unsupervised Data Monitoring

Written by Jeremy Stanley