Removing Outliers using Z-Score, IQR

Dharmaraj
2 min readApr 20, 2022

--

While preparing a model, we often come across situations where we see outliers present in the data set. These outliers are nothing but extreme values present or we can say the values that do not follow the pattern in the data. The values that diverge from all other values are termed outliers. We use two different methods to remove outliers.

Using Z-Score

An outlier of data is defined as a value that is more than 3 standard deviations from the mean.

import numpy as np
mydata=[2,3,3,12,3,9,12,4,340,5,2,2,6,1,8,6,1,300,7,5,4,3,9]
outliers=[]
def detect_outliers(mydata):
threshold=3 # 3 std deviation
mean=np.mean(mydata)
std=np.std(mydata)
for i in mydata:
z_score=(i-mean)/std
if np.abs(z_score)> threshold:
outliers.append(i)
return outliers
detect_outliers(mydata)

Using Interquartile Range (IQR)

Quartiles

Quartiles mark each 25% of a set of data:

  1. The first quartile Q1 is the 25th percentile
  2. The second quartile Q2 is the 50th percentile
  3. The third quartile Q3 is the 75th percentile

Interquartile Range

The interquartile range IQR is the range in values from the first quartile Q1 to the third quartile Q3. Find the IQR by subtracting Q1 from Q3.

IQR = Q3 — Q1

Lower and Upper fence

uThe Lower fence is the “lower limit” and the Upper fence is the “upper limit” of data, and any data lying outside these defined bounds can be considered an outlier.

LF = Q1–1.5 * IQR
UF = Q3 + 1.5 * IQR

Once we detect Lower Fence and Upper Fence then we assume values falling outside of our fence are outliers.

import numpy as np
import seaborn as sns
data=[240,220,200,2,3,12,3,-35,12,-80,54,11,9,12,4,5,2,54,6,1,85,6,1,7,5,4,3,9]
dataset=sorted(data)
q1,q3=np.percentile(dataset,[25,75])
#print(q1,q3)
iqr=q3-q1
print(iqr)
lower_fence=q1-(1.5*iqr)
higher_fence=q3+(1.5* iqr)
#print(lower_fence,higher_fence)
outliers=[]
final_data=[]
for i in dataset:
if i<=lower_fence or i>higher_fence:
outliers.append(i)
else:
final_data.append(i)
print("Outliers :",outliers)
print("ata After removed Outliers :",final_data)

Output

Outliers : [-80, -35, 54, 54, 85, 200, 220, 240]
Data After removed Outliers : [1, 1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6, 7, 9, 9, 11, 12, 12, 12]

Boxplot

import seaborn as sns
sns.boxplot(data)
sns.boxplot(final_data)

Have doubts? Need help? Contact me!

LinkedIn: https://www.linkedin.com/in/dharmaraj-d-1b707898

Github: https://github.com/DharmarajPi

--

--

Dharmaraj

I have worked on projects that involved Machine Learning, Deep Learning, Computer Vision, and AWS. https://www.linkedin.com/in/dharmaraj-d-1b707898/