Detecting Outliers Using Univariate Method

Vineet Tripathi
The Startup
Published in
4 min readDec 5, 2020

You might have heard about removing Outliers as a part of pre-processing/cleaning before training a model for the same. What are outliers? Why should outliers be removed before further processing? What are the different ways to treat outliers? We will be covering this part of Exploratory Data analysis at some depth in this article.

Photo by Will Myers on Unsplash

If we search the Deep AI glossary, we find out that “ A statistical outlier is any data point in a dataset that is beyond a pre-defined distribution range, usually representing a measurement error or abnormal data that should not be included “. The critical thing to note in this definition that an outlier is typically a measurement error made while the data was being collected. For example — suppose while entering data, an additional digit (like 0 ) is included in the description. Further, it may be an abnormality like a credit card fraud case, i.e. 1 in 100, which cannot be ignored and is essential to the modelling. Further, natural variation also has outliers. But, We should not remove the outliers to get a better score. If we can explain the reason for the outlier, we should leave it in the dataset because it may provide some vital information. The definition further states about what is to be considered an outlier. It is defined with respect to some definition of distribution which will be discussed. This will require the use of statistics. For example,- you could have a Gaussian distribution.

Let us look at the first case (measurement error or corrupted data) and try to highlight why removing outliers is of utmost importance. In this case, suppose you have prices of the house on the y-axis, area of the house on the abscissa, and if you train a model using linear/polynomial regression when it contained an outlier, Then it will overfit the data to accommodate the outlier, and it will not be a good approximation in general. When the same model is used on the test data set, it will have high error ( like Mean absolute error(MAE), Mean Squared Error(MSE) ) and hence a less successful implementation. Overfitting for beginner readers to say is the tendency of a model to accurately adapt to the training data(In layman terms, curve fitting). So if outliers are removed, our model is more robust and less prone to overfitting, another advantage.

Outlier in a housing prediction dataset

There are different ways to identify and remove outliers. However, in this article, We would be discussing the statistical methods namely Z-score and IQR (Inter Quartile Range) because they take a step forward and help in removing the outliers rather than just identifying as we see in the case of Boxplot, Scatterplot. But It is always good to visualize the data and seek the reason for those abnormalities.

Z-Score

It is a measure of how many standard deviations away is a data point from the mean. Although it can be applied in any type of distribution, we should stick to using this for gaussian like distributions because it will be converted to standard normal distributions (SND) and we can apply the basic tenets of Z-Scores for SND.

  • About 68% of the data is contained within one standard deviation(SD) of the mean (SD)
  • 95% of the data is contained within 2 SD
  • 99.7% of the data is contained within 3 SD.

Therefore, We can say that a value which falls outside 3 SD is a very rare outcome/outlier/noise. Hence we will remove the data points lying outside 3 SDs after finding out their Z-Scores.

Z Score = (x – μ) / ơ
x = data point
μ = mean
ơ = standard deviation

It is interesting to note that the standard libraries in python use (n-1) in the denominator instead of n. This is in order to give an unbiased estimator in a sample distribution whereas n is used in the case for Population distribution. For the moment, we should not worry about this.

Source- K2analytics.co.in

IQR

Inter Quartile Range or IQR is another method which can be used in case of non-gaussian distribution as well as where Z-score is not much helpful. Although, remember that in both cases, we are considering symmetric distribution. We use the concept of the median in this case. Median is the middlemost observation in a distribution and is considered as 50 percentile. Similarly, 25 percentile and 75 percentile are computed, and their difference is called IQR.

IQR = 75% — 25%

It is taken as a standard that all values between 25% — 1.5* IQR and 75%+1.5*IQR will be considered for further processing, and others will be treated as outliers and removed. It is similar to Boxplot.

We have discussed in detail and we can easily apply both the methods using simple steps

Applying IQR is even much easier than this

At last, I will just list out a few ML algorithms which are sensitive and which are not sensitive to Outliers:

  1. Linear Regression — Sensitive
  2. Logistic Regression — Sensitive
  3. SVM — not sensitive
  4. Decision Tree — Not sensitive
  5. Ensemble — Not sensitive
  6. KMeans — Sensitive
  7. PCA — sensitive
  8. NeuralNetworks — sensitive
  9. KNN — Not sensitive

Hope you enjoyed learning the analytic method to identify and remove outliers as well as understood the importance of outliers.

--

--

Vineet Tripathi
The Startup

Sophomore at IIT Indore in Electrical Engineering