Published in

MLearning.ai

# Keep It Down! Don’t Make Me Come Over There!

Defining (and Dealing With) Noisy Data — Halloween Style!

Quick question for you: Does the following picture contain a crab?

There are a number of ways to look at what the “right” answer is, and in fact, if you had two human beings who were putting manual labels on images, “crab” may actually be one of the labels you’d give this image. This, however, would be a great example of potential “noise” in a data set that could cause errors in predictions.

Noise can be thought of as errors or ommissions in data that lead to incorrect predictions. For example, let’s say we are using the k-Nearest Neighbors algorithm and we happen to have data that contains two kinds of pictures: crabs and humans. The picture above is part of the data set, and has been labeled “crab.” Now, we take another data point, say this picture:

Our algorithm runs exactly like we would expect it to, and it turns out the closest image we have to the one we are trying to predict turns out to be the crab picture. So what’s the prediction?

It depends, as you might remember from our introduction of k-Nearest Neighbors, on what you set your value for k to. Let’s take a look at a visual example. In this data set, we are representing our crab pictures as X’s and our human pictures as O’s.

Let’s start at the left with the image that’s been labeled as “human” and has the line running to it asking if it’s “Legit or ‘Noise’?” What do you think? If it’s noise, shouldn’t we just remove it from the data and make our predictions better?

If only it were that simple.

The problem is we don’t actually know, just based on the visual above, the answer to the legit vs. noise question. It may well be there’s a crab picture that was mis-labeled, which would make this noise. Or maybe this is an image of a crab tattoo on a human arm, and this is a legit label for this data point. Again, we don’t know. And unless you’ve got the time to go through every point in your data and verify that it is accurate (and no, you don’t have time, and no, there’s not an app for that…) then you’re just going to have to accept the fact that almost all data sets of any significant size are going to contain noise in the form of mislabeled data, or data that was typed in incorrectly (for example, someone left off or added a digit to the price of something, increasing or decreasing the reported sales price by a factor of 10.)

Some people will try to approach this noise by simply “deleting outliers” in the data. To those who remember some popular details from the famous O.J. Simpson trial: “If the data doesn’t fit, you just delete it!” Unfortunately, this is often a costly move, since errors in data aren’t the only source of noise. Noise can also come in the form of information not captured in your data set that has a material impact on the prediction. For example, say there are a number of homes that have ridiculously low prices per square foot in your data, which are really skewing your numbers for home prices in that area. You decide to delete the outliers. Later, you learn those homes are in the vicinity of a nuclear waste dump site that is at risk of contaminating the soil / water / air in the area, people are getting sick at an accelerated rate if they live there, and so on. Those data points, it turns out, weren’t errors or outliers. They were signals that something was terribly wrong in the general vicinity, and the housing prices were legit.

So noise is hard to deal with. Back to our initial problem:

In our visualization, the question mark represents the horrifying picture of my face. Er, the picture of my horrified face. The closest data point to this test image is the X (which is labeled crab) which is the first picture in the blog post. If you’ll recall, k-Nearest Neighbors computes a distance from the test data point to every single data point in the data set, and then finds the k closest ones. The value for k is a value the data scientist (or aspiring data scientist) sets when running the algorithm.

Because we’ve created a visualization for this, we can clearly see that our test data point is surrounded by O’s, so if we had the benefit of this visualization, we would clearly want to predict that this was an O — meaning the image is clearly a human. However, the closest picture to our test point is the other one of me in the “pretend I’m a crab” costume. So if we set our k value to 1, the algorithm would incorrectly interpret the image of my horrified face as “crab.”

<Insert comment here from my kids about how that’s not necessarily incorrect…>

This is why, when we run nearest neighbors algorithms, we typically don’t set k to 1. We want to set it to something higher than that so, in the event our data point happens to land next to a data point that would be classified as noise, this is mitigated the next few nearest neighbors that are likely correctly labeled. So if we set k to 3 or 5, the vote would be 2 or 4 O’s and only 1 X, therefore the label human would be (correctly?) applied.

This can also go too far the other way too. If you set your k too large, you can also create problems depending on the shape of your data, so finding the “Goldilocks k Value” — the one that’s not to small but not too large, and thus returns correctly predicted labels with the highest accuracy — is part of the process of putting a Nearest Neighbors type algorithm into production.

Bottom line: noise can mean an incorrect label, errors in the data, or factors not captured in your data that have significant impacts on predictions. It’s a nearly inescapable problem, so assuming it exists and finding ways to mitigate it is an important goal when building predictive models.

--

--

--

## More from MLearning.ai

Data Scientists must think like an artist when finding a solution when creating a piece of code. ⚪️ Artists enjoy working on interesting problems, even if there is no obvious answer ⚪️ linktr.ee/mlearning 🔵 Follow to join our 18K+ Unique DAILY Readers 🟠

## Jason Eden

Data Science & Cloud nerd with a passion for making complex topics easier to understand. All writings and associated errors are my own doing, not work-related.