Outlier Detection in Machine Learning

akhil anand
Analytics Vidhya
Published in
5 min readNov 29, 2020

--

Source

What are outliers ?

Outliers are those datapoints which differs significantally from other observations present in given dataset.It can occur because of variability in measurement and due to misinterpretation in filling datapoints.

suppose we have a dataset consisting the distance between school and home of the students in km which is given as 5,6,7,8,9,10,6,7,100 here in this dataset 100 will behave like an outlier.

Should we remove outliers or not ?

It is not always required to remove outliers ,depens upon business case requirement we as a machine learning engineer decide whether we should remove outliers or not.We can drop outliers in a datasets of peoples favourite tv shows,but we cannot remove outliers when we have dataset about credit card fraud.It is upto your common sense and observation whether you should remove it or not .

Suppose atleast 30%( or a large amount) of datapoints are outliers means there is some interesting and meaningful insight in outliers and you should not remove it.

Why we should remove outliers ?

Outliers increase variability in datasets which reduces statistical significant makes our model less accurate.There are various methods to detect it in this blog i will discuss detection of outliers based on distributions.

Detection of outliers based on Distributions

i. Normally Distributed data :

In case of normal distribution if datapoints lie away from the range (μ + 3σ) and (μ — 3 σ) is considered as outliers.

Normal Distribution
# Plotting and observing whether dataset has outliers or not
plt.figure(figsize=(16, 4)) # figure size
plt.subplot(1, 3, 1) #multiple plot plottig and 1st position of fig
sns.distplot(df["RM"], bins=30) #checking is data normally dist ?
plt.title('Histogram')
plt.subplot(1, 3, 2) #plotting second position of figure
sns.boxplot(y=df["RM"]) #boxplot
plt.title('Boxplot')
plt.subplot(1, 3, 3) #plotting third position of figure
stats.probplot(df["RM"], dist="norm", plot=plt) #q-q plot to check #how our data is distributed in reference with normal distribution
plt.ylabel('RM quantiles')
plt.show()
figure 1

As we see figure 1 distribution plot explains that the dataset is normally distributed ,box plot and Q-Q plot we can say that there is some outliers present in the dataset. Now our next step would be to find minimum and maximum boundary value out of which every datapoint would be considered as outliers.

#outlier boundary value for normally distributed dataset
def min_max_boundary(data,col):
min_value=df[col].mean()-3*df[col].std()
max_value=df[col].mean()+3*df[col].std()
return min_value,max_value
min_max_boundary(df,"RM")
Minimum and maximum boundary value

Any value more than 8.39 and less than 4.17 would be considered as outliers.

Removing outliers :

#filtering all the value mabove maximum boundary value and below #minimum  boundary value 
df=df[(df["RM"] >4.683568137432223) & (df["RM"] < 7.7636498112857)]
--------------------------------------------------------------------
#plotting the df["RM"] after removing outliers
plt.figure(figsize=(16, 4))
plt.subplot(1, 3, 1)
sns.distplot(df["RM"], bins=30)
plt.title('Histogram')
plt.subplot(1, 3, 2)
sns.boxplot(y=df["RM"])
plt.title('Boxplot')
plt.subplot(1, 3, 3)
stats.probplot(df["RM"], dist="norm", plot=plt)
plt.ylabel('RM quantiles')
plt.show()
--------------------------------------------------------------------
figure 2

ii. Skewed Distributed data :

If value doesnot lie in between the range :- 25percentile- (1.5*IQR) & 75percentile+(1.5*IQR) then datpoint is considered as outliers.

Here IQR= Q3-Q1 (As show in figure given below)

Box Plot
#plotting different plot to analyse presence of outliers
plt.figure(figsize=(16, 4)) # figure size
plt.subplot(1, 3, 1) #multiple plot plottig and 1st position of figure
sns.distplot(df["LSTAT"], bins=30) #checking data is normally distributed or not
plt.title('Histogram',fontsize=20)
plt.subplot(1, 3, 2) #plotting second position of figure
sns.boxplot(y=df["LSTAT"]) #boxplot
plt.title('Boxplot',fontsize=20)
plt.subplot(1, 3, 3) #plotting third position of figure
stats.probplot(df["LSTAT"], dist="norm", plot=plt)#q-q plot to check how our data is distributed in reference with normal distribution
plt.title("Q-Q plot",fontsize=20)
plt.show()
figure 3

As we see figure 2 distribution plot explains that the dataset is right skewed boxplot shows some datapoints that is away from upper whisker hence outliers are present in dataset.Q-Q plot’s alignment is away from 45 degree of angle depits presence of outliers in dataset now our main task is to find the boundary of minimum and maximum value out of which data would be considered as outliers.

#finding upper and lower boundary limit
def non_normal_outliers(data,col):
IQR=df[col].quantile(0.75)-df[col].quantile(0.25)
lower_limit=data[col].quantile(0.75) + (1.5*IQR)
upper_limit=data[col].quantile(0.25) - (1.5*IQR)
return "lower limit of dataset : {0}, upper limit of dataset
{1}".format(lower_limit,upper_limit)
non_normal_outliers(df,"LSTAT")

we can write same code in another way to print minimum and maximum value in the form of list.

list1=[]
def outer_function(data,col):
# Hidden from the outer code
IQR=df[col].quantile(0.75)-df[col].quantile(0.25)
def max_value(data,col):
max_=df[col].quantile(0.75) + (1.5*IQR)
return max_
list1.append(max_value(data,col))
def min_value (data,col):
min_=df[col].quantile(0.25) - (1.5*IQR)
return min_
list1.append(min_value(data,col))
#inner_increment(5)
outer_function(df,"LSTAT")
list1
--------------------------------------------------------------------
[out]>> [31.962500000000006, -8.057500000000005]

Removing outliers :

#filtering values lie above and below min and max value
df=df.loc[(df["LSTAT"]<list1[0]) & (df["LSTAT"]>list1[1])]
--------------------------------------------------------------------
#plotting the dataset after eliminating outliers
plt.figure(figsize=(16, 4))
plt.subplot(1, 3, 1)
sns.distplot(df["LSTAT"], bins=30)
plt.title('Histogram')
plt.subplot(1, 3, 2)
sns.boxplot(y=df["LSTAT"])
plt.title('Boxplot')
plt.subplot(1, 3, 3)
stats.probplot(df["LSTAT"], dist="norm", plot=plt)
plt.ylabel('RM quantiles')
plt.show()
figure 4

Conclusion:-

Hope you have gained sufficient knowledge about this topic please share your opinion about this blog. Keep learning keep growing.

--

--