Data Provenance and Coverage Analysis for High Quality ML Models

Published in

Doma

10 min readJan 6, 2021

Before data scientists get to work building machine learning models, they typically conduct a series of steps known as “Exploratory Data Analysis” (EDA). While these steps are taught at data science boot camps and universities everywhere, I would like to argue that ML teams ought to adopt an even more fundamental series of quality checks to ensure that data is usable for production, in advance of attempting EDA.

When shopping for a new home, it is prudent for a prospective buyer to hire an inspector to assess not only the overall quality and condition of the property prior to purchase, but also the land and foundation upon which it is built. Similarly, when building a machine learning model, it’s important for a data scientist to establish the quality, integrity, and accuracy of their data, because without a rock-solid foundation in these areas, ML models are likely to fail in production. In both scenarios, incorrect assumptions about underlying quality can lead to disastrous outcomes.

“Data is at the core of State’s Title machine learning based product offerings and it is extremely important to ensure we properly assess and certify the data used in our models from initial data evaluation through to model deployment.”

This article will focus on some of the techniques and methodologies that can be used at the early data evaluation and preparation stages of the data modeling process to help ensure a model’s training and input data quality, accuracy, and integrity, before the EDA steps.

The data background check

Let’s start at the very beginning: obtaining and assessing source data. First, I want to distinguish between two different types of data sources: public and private. For our purposes, we’ll define public data as that which can be obtained by the general public at no cost, such as U.S. census data. Private data is that which is acquired through some cost or agreement with a third party, or data obtained within an organization. Though many of the evaluation techniques we’ll explore cater to both types, there are some additional factors at play when handling private data acquisition and evaluation particularly from third party providers. We’ll highlight those as we go along.

Prior to utilizing any data provider or data source for modeling purposes, we need to spend time learning and evaluating as much as we can about the source itself. This allows us to build a surface-level confidence in the data and make sure it is usable for our modeling initiative. It’s helpful to write down as many questions as possible and document the responses and findings related to those inquiries. Think of this as a background check. The data source should satisfy the check prior to being approved for your modeling use case.

It is worth exploring topics including:

Raw data collection

Where is the raw data being collected from? Is it manually gathered and compiled? Is it coming from some other private data provider?

2. Content

What features does the data include?

3. Data freshness

How fresh is the data? What is the frequency at which it is being updated?

4. Historical data fidelity

How are data updates handled? Are historical records replaced or retained during updates? Can we reliably retrieve snapshots of what the data would have looked like at a specific point in time?

5. Coverage

What is the data coverage with respect to time, geographic location, etc.?

Let’s take a closer look at each of these as we evaluate a public data set using the following hypothetical scenario:

Our team is building a model to predict New York City residential property sales prices. Historical sales price data is something that we’d like to incorporate into our model. For this purpose, we will evaluate New York City’s Department of Finance (NYC DOF) sales data as a prospective data source for our model.

Let’s assume that our team has access to a separate dataset that we completely trust, which has sales history data for a set of 10,000 NYC property addresses. This will serve as data we’ll use to validate the prospective new data source.

Raw data collection

It is important to understand details about the collection of the prospective data source. It provides initial insight into completeness and relevance for your use case. If working with a private data provider, many times this is uncovered by simply asking. It isn’t rare for private providers to obtain their information from multiple sources and/or data vendors. Though this is considered second-hand data, it may not always be a bad thing, especially if you find benefit in going to one place for a consolidated set of data as opposed to onboarding multiple vendors or sources yourself.

In our case, the NYC DOF gives the following information about their data on their website:

The NYC DOF is responsible for recording and maintaining all official documents related to NYC real estate, across all boroughs.

The sales data is available for Tax Classes 1,2, & 4 that include residential sales, and goes back to 2003.

Based on this information, we can consider the NYC DOF’s sales data a promising first-hand source of historical NYC residential property sales. And given that it goes back 17 years, it seems like a great data source for model development.

Content

Data dictionaries or other documentation are usually helpful in providing insight into the features of a data set. Spend some time browsing the data. Make note of unclear or ambiguous features and seek understanding of what they represent. Take a look at some of the values of your data and see if you can identify any patterns. Browse the unique values of each feature. Perform basic filters and assess how often certain features are populated or not. This is a good time to make general observations of the content that in the future may help answer questions on what is available or possible with the data source for modeling purposes.

Below is a list of features available and other descriptive statistics of our NYC DOF dataset.

Documentation is also available on the NYC DOF website.

Sales Date and Sales Price are two key sales features we will definitely need. Square footage and year built are property characteristics that are available and could also be useful for our model. Other than these, no other property or area-specific characteristics exist in the data set. Assuming we choose to proceed with the NYC DOF dataset, we may want to consider joining this data with public census data to gather facts about a property’s neighborhood. Some of those characteristics should prove to be strong predictors of sales price.

Data freshness

Understanding how up-to-date a dataset is helps shape understanding of how useful it may be. These days, it is rare to find a scenario where more-frequent data updates are not desired, but again, think about what your data needs are and consider the update frequency that would work best for your use case. Just like understanding raw data collection, if working with a vendor, ask them how fresh their data is and how often it is updated. Otherwise, do some online research and see if you can determine this on your own. For our purposes, per the NYC DOF website:

A file is published monthly which includes property sales data collected over the past rolling 12 months.

Given that their data is updated monthly, this is likely good enough to capture general real estate market fluctuations over time, and could be used for model retraining every month. Note that, in an industry where markets can fluctuate more dramatically over a shorter time period (such as the stock market), a once-per-month update might be a major cause for concern.

Historical data fidelity

When evaluating data that will also be used for model training, historical data fidelity is huge. We need to make sure that records that are considered historical are preserved as of the point in time the event occurred and not overwritten in any way. This will allow our data science models to be trained on a true representation of information that is not affected by time travel.

Using our property sales example, we should be able to identify and distinguish the sales that occurred on a property up to a point in time. For private sources, asking the data provider may be helpful. For public data sources, documentation may provide insight. Nevertheless, in both cases, it is best to verify for yourself. This is where a validation data set starts to become handy. Let’s dive in…

By quickly eye-balling the NYC DOF sales history data, we can see that Sales Date is an available feature, which could mean that any given sales history record should be as of the sales date, and the historical data fidelity should be intact. But, we can take this a step further by comparing our validation data set to the data source.

Our validation set includes 10,000 NYC residential property addresses with sales history, including sale date and sales price, between 2017 and 2020.

We joined our validation set to the consolidated NYC DOF files using address. We then attempted to align each property’s sales date and sale price from our validation set to the NYC DOF files.

We were able to successfully match sales prices and sales dates for some property addresses which gives us confidence that sales history is preserved, but there were some mismatches. Though this could be a data quality issue, this provides a good opportunity to investigate some of the differences and determine how to potentially address these issues during our model build.

For example, we were able to identify a number of records in the NYC dataset with a $0 sales price that did not match our validation dataset. Since we are attempting to build a property sales price predictor, and we know that the sales price of a property cannot be $0, we may want to either omit these transactions as a part of our model data preparation to avoid skewing our model results, or evaluate the significance of using these $0 transactions as a feature to determine sales price.

Coverage

Understanding the scope and span of the data source is something else to consider when undergoing evaluation. For example, time range is often a valuable feature. How many days, months, or years does your data span? Other coverage factors to consider may depend on the use case. For example, geographic location coverage may matter. For our sales data use case, both time and location coverage are a requirement. For private data sources, attempt to gather these facts from the data provider, but also undergo a discovery of these facts using a validation set. The results may prompt questions requiring clarification prior to proceeding.

Using our validation set, we joined our sales data to the consolidated NYC DOF dataset using property address. Our results are as follows:

Although not ideal, since our hit rate is not 100 percent across all boroughs, it is at an acceptable level for purposes of our initial data discovery. Oftentimes, joins across different data sets may require more involved data cleaning and formatting to yield a better match rate. This can be completed in a later phase of data prep once it is agreed to proceed with the data set. For this initial evaluation, the results are satisfactory and we will move forward. If match rates were much lower for any given borough, less than 20 percent for Staten Island, for example, it may be questionable to rely on a model built on this dataset for that specific borough given what appears to be low coverage in the area. A data scientist may choose to handle the uncertainty within the model as a function of the borough a prediction is being made for. The model could then be tested to see how well it performs. Alternatively, the decision could be made to seek an additional data source that provides greater coverage in the weaker boroughs.

Documentation

A word of advice — document all information and work performed during the data evaluation. I know this isn’t the most exciting thing to do, but as annoying as it may seem, it will be worth it when you or anybody else in your organization questions the work performed upon making a data source selection. It will save you a lot of time and pain in the end when questions arise and you can’t seem to remember all the details.

Exploratory Data Analysis

Once we are confident in the provenance, historical availability, and overall quality of the data, we can proceed to what is typically the data scientist’s first step in evaluating data, EDA. To cover this topic in this blog post is unnecessary, given the number of excellent articles available on the topic.

In conclusion

Every use case is different, and the inquiries and validations performed for an initial data source evaluation will vary, but carrying out these exercises helps establish an initial assessment and understanding of the data that validates its provenance, historical availability, and richness. We don’t want to reach the EDA, model building, and productionization phases using data that ultimately is incomplete and unreliable for our use case. ML success depends crucially on validating the foundational quality of a data set before handing it over to data scientists. A strong foundation will enable a solid build, and allow for a high-quality final product. Happy analyzing!

Data Provenance and Coverage Analysis for High Quality ML Models

The data background check

In conclusion

Written by Yalixa De La Cruz