The Expense of Poorly Labeled Data

Alegion

Published in

Alegion

7 min readJan 30, 2020

This blog piece was modified from a talk Nikhil Kumar gave on The Expense of Poorly Labeled Data at ODSC West 2019.

Introduction

Do you ever wonder what happens when you train a machine learning model on really bad data? In this article, we show the effect of bad data on a machine learning model. Specifically, we take a good data set conducive towards modeling and distort the data in two different ways. First, we randomly distort the labels and train models using this randomly distorted label field. Second, we distort the labels in a biased way and then again train models using this distorted label field. These two types of distortion will allow us to see how different types of bad data can affect a model.

Data

For this analysis, we will use a relatively easy to manage and publicly available healthcare data set related to heart disease. Specifically, the below table shows the different features in this data set.

From a machine learning perspective, the objective is to predict the probability that a patient will have heart disease (CHD), given the other features. As such, the CHD feature contains the data labels, and this is the field that will get distorted later in this analysis. This data set can be found in the Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jermoe Friedman. The data can be easily downloaded and can also be found here: https://web.stanford.edu/~hastie/ElemStatLearn//datasets/SAheart.data.

Baseline Models

As earlier stated, the goal is to predict the probability that someone will have a heart disease given the other features. In doing so, we can consider two modeling methods — logistic regression and random forests, and observe their performance on this data set. Using each model’s default settings we can observe the following out-of-sample performance.

As can be seen, the AUC’s and out-of-sample accuracy measurements are relatively strong indicating that heart disease can be predicted reasonably well using the features in this data set. Furthermore, by viewing the variable importance plot, we can observe that the tobacco feature appears to be one of the most important features in predicting heart disease.

Now that we have gotten a sense of the predictive power of this data set, we can see what happens when we distort the data, in two different ways.

Random Distortion

The first type of distortion we will consider is a random distortion. This type of distortion involves taking a random percentage of the records in the training set, and switching the labels (i.e. if a patient had heart disease, then they get coded as not having heart disease, if a patient did not have heart disease, then they get coded as having heart disease). We then plot the model’s out-of-sample accuracy against the percentage of training records that have been distorted.

This graph lets us observe how accuracy is negatively affected by random distortion. However, we do notice that for small levels of distortion (<25%), there is a minimal effect on accuracy. Only after about 25% of records distorted do you see a sharp decrease in accuracy. This is empirical evidence of a model showing some resistance to randomly distorted labels.

Biased Distortion

The second type of distortion we will consider is a biased distortion. This involves distorting the labels based on the value of another feature. When we made the variable importance plot, we saw that tobacco was one of the most important features in this data set. Thus, we will distort whether or not someone has a heart disease based on their tobacco consumption. Specifically, if a patient’s tobacco consumption is greater than the median level of tobacco consumption, the patient will be coded as not having heart disease. Conversely, if the patient’s tobacco consumption is less than the median level of consumption, then the patient will be coded as having heart disease. After using this form of distortion, we constructed the same graph as before and observe what happens.

We can see how this type of distortion compares to the prior type of distortion. Specifically, we see that for both models, the biased distortion affects the accuracy at a much faster rate. The green and black curves have a much steeper descent than the original purple and orange curves. This is also evident in the fact that the resistance to small amounts of random distortion goes away when the distortion is biased. At the previous inflection point of 25% of records being distorted, the model’s accuracy measurements are dangerously less than what they were for the random distortion. This implies that biased distortion has a much stronger effect than random distortion on a model’s accuracy.

Variable Importance

In addition to looking at model accuracy, we can re-run the variable importance plot at different levels of biased distortion. This can give us a sense of how any inference from this type of distortion might get affected.

We can observe how the relative importance of the tobacco feature increases as the percentage of distorted labels increases. Specifically, as the percentage gets very high, tobacco’s importance is completely inflated at the cost of every other feature. This shows how any inference, or policy/business decision making would be completely abysmal if you were working with biased data.

Key Takeaways

To recap, we have shown how distorted data can ruin a model and any inference or decision making that comes from modeling. Specifically, we have shown that a biased distortion is demonstrably worse than a random distortion. A biased distortion can have a much stronger negative effect on model accuracy and inference than a random distortion. It is even the case that a model can have some level of resistance to small amounts of random distortion.

Furthermore, it is important to remember that the random distortion in this analysis was truly random — the randomness was completely programmatically designed. In practice, if a data set is distorted, it is very unlikely that the cause of the distortion is due to complete randomness. It is more likely that the distortion is a symptom of a systemic issue in the data generation process.

Examining Different Types of Bias

As we have determined that biased data can ruin a machine learning model, it is worthwhile to explore this type of bias in detail and explore possible ways to diagnose it. During this analysis, the type of bias that was used to distort the data is commonly referred to as systematic value distortion. This specifically refers to systematic errors in data processing in the context of bad data collection, measurement, recording, or labeling. These errors can cause systematic deviations between the true and observed values.

There are some common symptoms associated with models trained on this type of bias. The model consistently over or under-estimates a predicted value or category. In our case, we saw this when we plotted the out-of-sample error against different levels of distortion. The second symptom you might observe is consistent over or under inflation of a feature, at the cost of other features. In our case, we directly saw this when we generated the variable importance plot, at different levels of distortion. The third common symptom you might observe is the model generating surprising and unexpected results or outliers.

In terms of fixing this type of bias, there are a few different solutions you can consider. First, and probably most importantly, you should always maintain subject matter expertise when analyzing data. Analyzing data, without having any subject matter expertise can cause you to easily make very wrong and dangerous conclusions about your data. Subject matter experts can serve as a referee to bad analytical conclusions or infeasible assumptions you make about your data. Another solution to keep in mind is to use automated data capture, when possible. This can avoid any human error associated with data capture or labeling. If you need to use human judgements to capture or label your data, try to use multiple independent judgements. Doing this will mitigate human error and can reduce the chances associated with human bias associated with data capture or labeling.

Conclusion

This analysis illustrates how bias will ruin a data set and any model trained from this data. There is empirical evidence to show that the biased distortions, made during this analysis, are demonstrably worse than the random distortions that were made. There are different ways to observe and diagnose the type of bias used in this analysis, which is why subject matter expertise is critical to produce an accurate analysis. Bias is a commonly occurring issue that requires constant attention. Overlooking the effect of biased data can have dire consequences on any modeling initiatives or any policy making and business decision making originating from the data.

Learn More

To learn more about what causes poor quality training data and how your business can invest in preventing it, check out The Cost of Poorly Labeled Data.

Interested in learning more about bias in machine learning? Check out No BiAS, a podcast about the emerging and ever-shifting terrain of artificial intelligence and machine learning.

The Expense of Poorly Labeled Data

Written by Alegion