Outliers in Linear Regression

Wanze (Russell) Xie
Nov 3 · 5 min read

How would you deal with outliers in a linear regression problem? What if the dataset is huge or high dimensional so that you cannot visually identify them? Couple days ago I got asked this question and I think this would be a good chance to put my thoughts straight down here.

Remove Outliers

The first approach you might want to try is to remove the outliers from the data. (1) you may run a graph analysis (box-plot/scatter plot) to identify the outliers visually and remove them; (2) you may compute Z-scores (# of standard deviations from the mean) of each data point, whose formula goes like this:

S is Sample standard deviation

where:

Then you can set a threshold (e.g: 3) to filter out the data that has Z-score larger than that.

Use L1-loss (L1-estimator)

The second approach is using l1-estimator, namely using L1-norm in the loss function instead of l2-norm that is used in the least squares. This is because the solution to the least-squares,

by finding the MLE for a (solving derivative equals 0), is the mean value

whereas when using l1-estimator

the well known solution would be the median of {x_i}. Therefore, the small number of outliers won’t affect the solution.

But the disadvantage of l1-norm is that the function is not differentiable anywhere. So we want to stay with differentiable problems and incorporate robustness to estimation.

Locally Weighted Linear Regression

The third approach, which is pretty popular, is to apply a non-negative valued weights to the each of the squared error:

According to Andrew Ng’s lecture note[4], if the weight w is large for a particular squared error, then when picking Theta, we will try hard to make the squared error at i small. If w for i is small, then the squared error term will pretty much be ignored in the fit.

Therefore, the motivation is to have a weight function that gives lower weight to the outliers and higher weight to the in-liners, and a fairly standard choice for the weights can be:

Note that the numerator represents how far the current data is away from the point x that we’re trying to evaluate. In practice, x can just be the mean or median. The Tau, called bandwidth parameter, controls how quickly the weight of a training example falls off based on the distance between current sample from the query point x.

Now if you look at this function, if the distance is small or close to 0, w will be close to 1. If the distance grows large, then w will grow smaller because of the negative sign inside the exponential function.

Use Sublinear Wrapper Function for Robustness

The next, and also one of the most common approaches, is to introduce a sublinear (grow slower than linear) function p(z) for the least-square optimization, and we can see how it is an expansion to the previous idea:

And the core idea behind this is to penalize the data that has higher squared error (or residuals). Couple popular choices to the p function:

According to Scipy documentation[1], 2 and 3 are relatively mild and give approximately L1-norm loss for large residuals. 4 and 5 give significant attenuation for outliers.

Generalized form of Robust Regression

The loss function above assumes that the (soft) threshold between in-liners and outliers is 1.0. Once we’ve got the idea above, we can generalize the form by adding scaling parameter C to the loss:

And for example, if we use Soft L1-Loss in (3) above, the loss function would be something like:

where C would be the scale factor f_scale, and in python the code is like:

res_robust = least_squares(fun, x0, loss='soft_l1', f_scale=0.1, args=(t_train, y_train))

where “fun” is the 𝜑 function in the above formula, and x0 are the estimates in 𝜑(t_i;x) depending on your way to compute residuals in your model function (may be just a linear combination as in LWR).

One more trick

One more trick I learned but I am not sure if it can be practical, is that, in a naive setting (e.g.: 2D linear regression), you can randomly pick two points in the dataset and compute a line based on that. Then the goal is to minimize the error of all the data points in regard to this line, and you want to find the “best two points” to minimize this error. The error function can be just either L1 norm or L2 norm.

In practice, you perhaps don’t want to iterate through all the possible combinations of two points, but just set up an error threshold, and when the error reaches (below) your goal, then stop and return the estimates based on that two points.

Hope you enjoyed this reading. Let me know if you have any thoughts :)

Reference

[1] https://scipy-cookbook.readthedocs.io/items/robust_regression.html

[2] https://link.springer.com/chapter/10.1007%2F3-540-44480-7_21

[3] https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/z-score/

[4] http://cs229.stanford.edu/notes2019fall/cs229-notes1.pdf

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade