Ways to detect and handle outliers

clarence wu
6 min readNov 18, 2022

--

https://www.sciencenews.org/wp-content/uploads/2022/01/010322_sg_positive-deviance_feat.jpg

Identify outliers is an import step during EDA(Explore data analysis) which can improve the quality of data, thereby affecting affect further statistical/Machine Learning models. In this post, I will introduce: what are outliers? Why do we need to handle outliers? How to detect and handle outliers Let’s get started.

What is outliers

Based on the Wikipedia definition, “In statistics, an outlier is a data point that differs significantly from other observations." An outlier may be due to variability in the measurement or it may indicate experimental error; the latter is sometimes excluded from the data set. "An outlier can cause serious problems in statistical analyses.”

So, an outlier is a data has a value too high or too low when compared with other data. For example, in a high school class, almost students are around 18 years old, however there is a student aged 35 years.

Outliers are caused by many reasons such as changing the sensitivity of the sensor, experimental errors or data handling errors. Any way, before we data analysts or scientists process the data, outliers can be caused at any step.

Why we need to handle outliers

“Garbage in, garbage out." The quality of the data determines the upper bound of machine learning models. This is because models are sensitive to the range and distribution of values. You can read these two articles. Three ways to handle imbalanced data and Which models require scaling data? to understand the side effects. Similarly, outliers can distort models, leading to longer training times, less accuracy, and poor performance. For example, the RMSE loss function is sensitive to outliers and will be much larger in the case of outliers, so the loss function will try to adjust the model according to these outlier values, even at the expense of other samples. In addition to this, boosting models increase the weights of misclassified points on each iteration and therefore might put higher weights on these outliers as they tend to be misclassified. This can become an issue if that outlier is an error of some type or if we want our model to generalize well and not care for extreme values. However, not all models will be affected by outliers; for some models, you can ignore them.

Outliers Sensitive Algorithms: Linear Regression, Logistic Regression, Support Vector Machine, K-means, KNN

Outliers Immune Algorithms: Tree-based or complex algorithms

How to detect outliers

In this post, I will introduce four ways to detect outliers, which are the histogram, box plot, 3-sigma and scatter plot with the Boston housing prices dataset.

Load data

from sklearn.datasets import load_boston
import pandas as pd


boston = load_boston()
X = boston.data
y = boston.target
df = pd.DataFrame(X,columns=boston.feature_names)

Histogram

Histogram can help us understand the distribution of data and how often they appear. It can give us a direct sense of distributions and outliers. Pandas .hist method can give a quick overview of distributions of all features, and then we can choose some of them to study in detail.

df.hist(bins=15,figsize=(10,20)

We can see that CRIM, NOX, DIS, INDUS, LSTAT features have outliers. We can use methods such as box plot and 3-sigma to inspect them.

Boxplot

Boxplot is a visualization of quartile, so we need to understand quartile.

https://cdn1.byjus.com/wp-content/uploads/2021/03/interquartile-range.png

Median is the value of the element at the middle after sorting. The first quartile(Q1) is defined as the middle number between the smallest number and the median of the data set. Similarly, the third quartile(Q3) is the middle value between the median and the highest value. And the interquartile range(IQR) is defined as Q3-Q1. After understanding these information, outliers means data which below Q1–1.5*IQR or above Q3+1.5*IQR shown as below.

https://www.statology.org/wp-content/uploads/2021/01/iqrOutlier1.png

Let’s try it

import seaborn as sns  

sns.boxplot(x=df['DIS'])

We can see from the diagram that there are several data points are above Q3+1.5*IQR

3-sigma

sigma here means standard deviations. 3-sigma means the region between u — 3*sigma and u + 3sigma, u is the mean of data, and this region contains 99.73% data. If one data is not in the range, it can be identified as an outlier.

we can compute how far a data point is from the mean using the formula.

If z is smaller than -3 or bigger than 3, it is an outlier.

Q1 = df["DIS"].quantile(0.25) 
Q3 = df["DIS"].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['DIS']>Q3+1.5*IQR)|(df['DIS']

Scatter plot

Scatter plot is a type of plot which shows the distribution or correlation of two variables. Points that are significantly far from the general cluster/correlation line of points are called outliers.

fig, ax = plt.subplots(figsize=(16,8)) 
ax.scatter(df['INDUS'], df['TAX'])
ax.set_xlabel('INDUS')
ax.set_ylabel('TAX') plt.show()

From the scatter plot, we can see that most data points lying bottom left while there are several points which are lying top right and far from cluster.

How to handle outliers

Once we detected outliers, we should take some measures to handle outliers. Here are some methods I summarize:

  1. Drop outliers. Deleting outliers means losing information, so you should be very sure that these outliers are caused by error, for example, system error or measurement error. Also, we should have lots of data points and only few outliers. You can do it by hand, or you can use Winsorizer

2. Replace outliers. This method means replace outliers with some specified value such as mean, median, 5% percentile and 90% percentile

3. Use tree based models such as Random Forests and Gradient Boosting.

4. User MAE instead of RMSE unless you want to predict these outliers.

# Remove outliers based on IQR range 

Q1 = df["DIS"].quantile(0.25)
Q3 = df["DIS"].quantile(0.75)
IQR = Q3 - Q1
valid_data = df[(df['DIS']>=Q1-1.5*IQR)&(df['DIS']<=Q3+1.5*IQR)]
# Remove outliers based on percentile
low = df["DIS"].quantile(0.01)
high = df["DIS"].quantile(0.99)
valid_data = df[(df['DIS']>=low)&(df['DIS']<=high)]
# Remove outliers based on Z-score
z_score = np.abs(stats.zscore(df['DIS']))
DIS_zscore = pd.concat([df['DIS'],z_score],axis=1)
DIS_zscore.columns = ['DIS','z_score']
valid_data = DIS_zscore[(DIS_zscore['z_score']>=-3)&(DIS_zscore<=3)]

Conclusion

In this article, we go through four problems including what are outliers, why do we need to handle outliers, how to detect and handle outliers. When choose the method to handle outliers, you can try different one and compare their performance.

--

--

clarence wu

data science master graduated from Glasgow University