Outliers in data.

Nishantthakur
3 min readAug 12, 2022

--

I tend to spell outliers as out-liars.All they basically do is that they are so different from the crowd that they start to give a false representation of the data.Outlier is that student who manages to score a 98% in a class where everybody else was not even able to score passing marks.Outlier is like one my favourite NBA players Muggsy who was 5'3" and still was one of the best(Well,according to me).Outlier is like the black sheep in the above picture.There are so many examples of what an outlier could be.

But the real question is Do they hinder the data quality?

In most cases,Yes.

They do tend to give a false information about the data and can be ‘dangerous’ even during a simple linear regression model as they can completely change the slope of the linear regression equation.

How to detect Outliers?

  1. The Z score method

If the data is normally distributed,we can use the Z Score to find out the outliers.

As we can see that in a normally distributed data , 99.7% of the data points are less than 3σ away from the mean.Thus data points not falling under this category i.e which fall away from the 3σ (both towards the right and the left) can be considered as an outlier.

The Z Score is calculated as z = (x — μ) / σ where x is a particular data point μ is the mean and σ is the standard deviation.Let’s say that Z Score for a particular data point is 2.It means that it is only 2σ away from the mean.Therefore a data point with Z score less than 3 cannot be considered as an outlier and a data point with Z score more than 3 and less than -3 can be considered as an outlier.

2)The IQR(InterQuartile Range Method)

If the data is skewed.We can still use the IQR method.It is my favourite technique as it is so easy to visualize.

For that you will need to plot a box plot.A box plot works on the merit of percentile.For explaining what is percentile,let us take an example of a competitive exam.Let’s say that you got a percentile of 70,that means your score was better than 70 percent of test takers.

Now in a box plot any data points above or below the maximum and minimum values are considered as outliers.Q1 is the 25th percentile,Q3 the 75th percentile.IQR is the interquartile range which is the difference between the Q1 and Q3(75–25=50) which means is consists of the 50% of data points.The values above the “Maximum” and below the “Minimum” are the outliers.

How to deal with outliers?

There are 4 major techniques to deal with outliers.

  1. Trimming

We just remove the outliers.Simple and Straight.Yet,we should only use this technique where number of outliers are less as this technique may decrease the density of data.

2. Capping

In capping,we just replace the outlier values with the 3.0 Z Score values and with the Maximum and Minimum values in case of IQR technique.

3. Missing value

We can identify the outliers as missing values.

4.Discretization

Discretization is the process through which we can transform continuous data into a discrete form. We do this by creating a intervals (or bins).For example for continuous values ranging mostly from 0–100 with 190 as a outlier.We can group that data into 0–20,20–40,40–60,60–80,80–100 with the 190 valued outlier being considered in the 80–100 interval.

--

--