When Data Disappears

Jeremy Stanley
Nov 10, 2020 · 5 min read
Image for post
Image for post
The most common data quality issue is no data at all

When we think of data quality, the first issues that come to mind are visible problems like duplicate rows, NULL values or corrupted records. But in fact the most common data quality issue is that data has simply disappeared.

In this post we will describe how data disappears, what the common causes are and what data teams can do to identify these issues.

Consider how companies process data into their warehouses:

Image for post
Image for post

Raw data is captured through logging systems or from external sources, then data loading systems pre-process the raw data and load it into a data warehouse. Then complex SQL pipelines filter data for important records, join multiple sources together and perform complex aggregations.

The resulting tables, often referred to as “fact tables”, are the golden datasets of an organization. Cross-functional teams leading strategic initiatives rely upon them, product managers make decisions using them and operations and sales teams are managed based on them.

It is hard to underestimate the importance of ensuring these tables are reliable.

But important fact tables can be the result of many transformations across disparate systems linking varied source datasets. This complexity increases the likelihood of incomplete data:

Image for post
Image for post
Data processing stages, popular platforms and examples of what breaks along the way

However, unlike some other data quality issues, there isn’t a single SQL query to validate that data is complete. This is because incomplete data can take many forms.

The simplest issue is that there is no recent data at all.

Image for post
Image for post
The data went entirely missing on the most recent date

Even this can be fraught with challenges, as we need to know when there should be data there. Ideally systems track how long a dataset typically takes to update, and alerts when new data is significantly delayed.

A more nuanced issue that can go undetected is that there are fewer records than expected, or that data disappeared for a small period of time.

Image for post
Image for post
The data appears to be incomplete on the most recent date.

In this case, you need a time series model to predict what range of row counts are expected. Such models need to control for trend, weekday seasonality, annual seasonality and holidays. Doing this reliably at scale can be challenging.

Even more difficult to handle is when data disappears for an important segment of the data.

Image for post
Image for post
An important segment of the data is almost entirely gone, but overall row counts remain plausible.

The missing segment might be small enough to not affect the overall row counts, but still mission critical for your business.

So what happens when data disappears? The consequences can be widespread:

Usually, a data organization’s first instinct is to rely upon the monitoring of the systems that produce the data. Infrastructure engineering is monitoring the logging system with production metrics. And data engineering is monitoring the data coordination and loading systems for outages or missing data.

But as an organization matures the way data is produced increases in complexity, and it becomes dangerous to rely upon monitoring of individual components:

Image for post
Image for post
Data processing flows almost always start simply, but become increasingly complex over time.

At Anomalo, we’ve found the only way to be certain your data is available is to test it independently from the systems producing it.

Image for post
Image for post
Data processing pipelines are complex, and require independent monitoring of the important data in the warehouse.

To ensure that data is available and complete, we run the following sequence of tests:

To set this up for a new table only requires a few pieces of information:

Image for post
Image for post
How to configure a table to be monitored for missing data in Anomalo.

Then, if we ever discover an issue with incomplete data we send a notification to the relevant teams’ Slack, Teams, PagerDuty or e-mail:

Image for post
Image for post
A slack notification showing an incomplete data issue. The green band is the predicted row count range per day, and the dots and lines are the actual row counts.

For example, the above Slack message is for a Google BigQuery Public dataset of San Francisco transit stops. We saw a big decrease in records on 2020–10–29. Our model predicted that there should be at least 10,053 rows, but only 3,672 rows exist. What happened to the other 6,500 rows?

Image for post
Image for post

We also note that this was the first time this table had failed to load data on time in the last 34 runs, providing users with a sense of just how unusual or extreme this behavior is.

Digging deeper, we show exactly what happened on the 29th:

Image for post
Image for post
Predicted row counts per hour range (green band) versus actual (lines and dots).

It appears that the data disappeared by 8am, and never returned that day.

When it is relevant, we show a breakdown of exactly which segments are appearing less frequently than expected:

Image for post
Image for post
A detailed view by segment, showing the predicted range of rows as green bars, and red dots indicating what actually happened for 4 types of content.

This chart is from a COVID-19 online news dataset, and is showing that on November 4th there were fewer articles online about Cases, Quarantine, Prices and Ventilators than expected.

These details help data teams rapidly triage issues, identify root causes, and communicate to the affected teams internally. In many cases, they even help accelerate the development and deployment of resolutions. Moving quickly and confidently to identify and resolve such issues greatly reduces the chances they negatively impact the rest of the company.

Before any dataset is used for mission critical decisions or products, data-driven organizations should validate the quality of the data using an automated system. The first step to get right is to ensure the data hasn’t disappeared.

To learn more about Anomalo, request a demo.

Anomalo

Ideas from Anomalo, the easiest way to validate and document all the data in your data warehouse

Jeremy Stanley

Written by

Founder and CTO at Anomalo; previously VP Data Science at Instacart.

Anomalo

Anomalo

Anomalo is the easiest way to validate and document all the data in your data warehouse. All without writing a single line of code. Set up data validation in less than 5 minutes, get a single destination for locating the data you need and can trust.

Jeremy Stanley

Written by

Founder and CTO at Anomalo; previously VP Data Science at Instacart.

Anomalo

Anomalo

Anomalo is the easiest way to validate and document all the data in your data warehouse. All without writing a single line of code. Set up data validation in less than 5 minutes, get a single destination for locating the data you need and can trust.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store