Measuring Data Volatility —
A First Line of Defense Against Bad Data

Radius Engineering
Radius-Engineering
Published in
7 min readSep 23, 2017

Data is at the heart of everything we build at Radius. The Network of Record is the collation of billions of actively refreshed data points across hundreds of disparate datasets related to businesses around the world. Each of these data points makes some contribution to our customers’ ability to drive revenue, and therefore require scrutiny before they enter our pipelines. Today, we share some details on how Radius monitors dataset stability with a framework that generalizes to every input we feed into the Network of Record.

Stewards of Data

On the Aggregation team, we are vigilant in our role as the stewards of data. Since we are the first touch point for every dataset, we are also the first (and sometimes only) line of defense against bad data.

With Radius’ wide array of data partners, public sources, and customer network contributions, chaos is the only constant as new data arrives on our doorstep. Not only does business data naturally evolve over time, there is also unpredictable human influence on every dataset we ingest. As a result, there will always be idiosyncrasies and undesirable anomalies that can pollute our data pipeline if we don’t measure and detect them.

Did a data partner unexpectedly remove a critical field? Are duplicate records creeping into our pipeline? Did some broad formatting change get applied to a field, affecting every record in a dataset? Understanding and managing this volatility is necessary, but can be immensely time consuming without proper process and weaponry. Moreover, as the Network of Record grows, so does the number of data sources that require some form of quality assurance. While we can throw people at the problem, we seek a higher leverage solution.

Inverting the Law of the Instrument

The “law of the instrument” generally reads, “If all you have is a hammer, everything looks like a nail.” One could take this to mean that it is a fool’s errand to engineer a solution to data quality assurance that is one-size-fits-all.

But we prefer to flip this on its head to read, “If you can make everything look like a nail, then all you need is a hammer.”

One approach is to stop being distracted by the unique personalities of each dataset and instead focus on what the common attributes of every dataset are. At your annual doctor check-up, your physician does not know your entire life’s story, and there are an infinite number of signals they could monitor for illness. The best they can do is check that your key vitals are within normal bounds, and see if you are trending in any particular direction.

We use a similar process for data quality assurance, and use data volatility as a key vital. Instead of answering the hard question “Does this dataset accurately represent the real world?”, we first answer the easier question “Is the volatility in this dataset expected?” If we can axiomatically reduce data to a set of volatility measurements that is agnostic to where the data came from, when we got it, who gave it to us, how they gave it to us, and even what the data is supposed to represent, then we can fashion a proper tool to robustly solve a large part of our data quality assurance needs.

Our Data Monitoring Cookbook

With this in mind, our recipe for automated data monitoring is as follows:

  1. Convert every incoming dataset to a unified format
  2. Extract volatility metrics from every dataset
  3. Monitor these metrics and alert on abnormalities
  4. If there are no significant abnormalities, promote the dataset into the Network of Record
  5. Otherwise, generate human readable reports to facilitate troubleshooting

The first step is straightforward: we convert every dataset to JSON and define its structure with jsonschema. Any data points that fail a structural schema check are put into a holding cell for further investigation, while the rest proceed to metrics extraction. Beyond ensuring that all our data is in a unified format, these structural checks ensure we immediately ignore data that is obviously flawed.

For the remainder of this discussion, we will focus on the second step and how we define volatility metrics.

Measuring Data Volatility

When discussing our data volatility metrics, it’s helpful to think of a dataset as a set of records (e.g., rows in a table) that each have a set of fields (e.g., columns in a table). Each record may have a record identifier (RID) field that can be used to uniquely identify a record across multiple versions of a dataset.

We define our volatility metrics based on the following questions:

  • How can the size of the dataset change over time?
  • How can a field change over time?
  • How can a record change over time?

Below is a summary of our volatility metrics. We’ll go into each in further detail.

Dataset Size Change Metrics

These are our simplest metrics, but are surprisingly informative. By tracking dataset size over time and how RIDs enter and leave a dataset, we not only catch rapid size fluctuations, but also get a sense of any regular cadence in a dataset. For example, we know to expect volatility around once a month for the below dataset.

Field Change Metrics

Since we are often interested in some critical fields (e.g., business name, mailing address, e-mail, phone, etc.), our most frequently used metrics focus on the changes within a particular field. For the purposes of this post, let’s assume we are interested in a field representing the annual revenue of a business.

Record fill rate

This measures the percentage of records that have a usable value for a particular field, and we track this fill rate over time. For example, this would allow us to catch when the revenue field is suddenly removed, or is not getting parsed correctly due to a recent formatting change.

Type distribution

A field value can be one of the following types: boolean (true or false), string, number, a list of objects, or a nested record. While we generally expect a field to have the same type across all records, we don’t like surprises. For example, if revenue abruptly changed from a number to a string, downstream processes would almost certainly have unexpected behavior.

Value distribution

Some fields are what we call enumerations because the set of unique values for the field is relatively small. For example, some datasets have revenue as an enumeration of bands, such as:

  • Less than $250,000
  • $250,000 to $10 Million
  • $10 Million to $1 Billion
  • $1 Billion to $10 Billion
  • Over $10 Billion

We track whether bands are added or removed over time, and also if the number of records in any band changes significantly. If many companies are all of a sudden making over $10 billion in revenue, while we would be very happy for them, something is likely wrong with the dataset.

Record Change Metrics

Our most complex metrics involve measuring the average change in a record across time. These usually require that we have a well-defined RID so we can compare versions of a record across multiple versions of a dataset.

Field fill rate

One simple metric we track is how many fields are populated per record, and how this changes over time. This is a measure of how “fat” each record is in a dataset, and if we notice that records are getting significantly fatter or skinnier en masse, we pause to investigate why. For example, the fill rates for revenue and headcount may look stable at 75% when looking at the entire dataset, but it so happens that many pre-existing records had both their revenue and headcount values removed while new records have both those fields populated.

Number of value changes

This is where things get complicated and fun. It is very useful to know how many records in a new dataset have a revenue update, and what the average magnitude of these updates are.

To calculate this, we do the following:

  • Extract the last observed revenue value and the new revenue value for each unique RID
  • Count the records that had a revenue change (as well as those that stayed the same)
  • For the records with a revenue change, count the different ways the revenue changed. If revenue is a number, we would count things like the following:
  • Whether it got set from a number to a “null” value (i.e., the field was “emptied”)
  • Whether it got set to a number from a “null” value (i.e., the field was “filled”)
  • The distribution of raw change magnitude (e.g., how many records increased by $1 Million, how many records decreased by $10 Million, etc.)
  • The distribution of relative change percentage (e.g., how many records increased by 5%, how many records decreased by 1%, etc.)

Every type (e.g., booleans, lists, etc.) has a different set of change metrics that we don’t have time to completely elaborate on. Suffice to say there are usually hundreds of change metrics across dozens of fields to monitor per dataset, and these are invaluable both for preemptively debugging odd-looking datasets and for tracing issues in the Network of Record back to a particular dataset.

Would You Like to Know More?

Now that we have hundreds of metrics to monitor, how do we make them useful? In future blog posts, we will elaborate on methods for outlier detection and transforming all these signals into lovely human-readable reports for our data engineers and scientists.

If you enjoy rigging up automated quality assurance systems, got excited by any of this talk of metrics, and want to help build data pipelines, we encourage you to check out https://radius.com/careers/ and start a conversation with us. We’d love to have you!

Nick Hwang — Engineering Manager @ Radius

--

--