Outlier Treatment with Python

Sangita Yemulwar
Analytics Vidhya
Published in
4 min readSep 16, 2019
Photo by Jessica Ruscello on Unsplash

1 — What is an Outlier?

An outlier is a data point in a data set that is distant from all other observation.

A data point that lies outside the overall distribution of dataset

Many people get confused between Extreme values & Outliers.

2 — What is an Extreme Value?

An Extreme value is just a minimum or a maximum, it need not be much different from of the data.

3 — What is difference between Extreme value & Outlier?

An Extreme value is just a minimum or a maximum, it need not be much different from the data & a point that is far a way from the other points called as outlier.

Example: -Age of employees

Age = 21, 23, 24, 25, 26, 28, 30, 45

Where

Extreme value =30

Outlier =45

4 — What is the reason for an outlier to exist in dataset?

4.1- Variability in the data

4.2 - An Experimental measurement error

5 — How can we Identify an outlier?

5.1-Using Box plots

5.2-Using Scatter plot

5.3-Using Z score

6 — There are Two Methods for Outlier Treatment

  1. Interquartile Range(IQR) Method
  2. Z Score method

6.1 — IQR Method

Using IQR we can find outlier.

6.1.1 — What are criteria to identify an outlier?

Data point that falls outside of 1.5 times of an Interquartile range above the 3rd quartile (Q3) and below the 1st quartile (Q1)

6.2.2 — Removing Outliers using IQR

Step 1: — Collect and Read the Data

Step 2: — Check shape of data

Step 3: — Check Outliers

import seaborn as snssns.boxplot(data=df,x=df[‘hp’])

Step 4: — Implementation

Q1=df[‘hp’].quantile(0.25)Q3=df[‘hp’].quantile(0.75)IQR=Q3-Q1print(Q1)print(Q3)print(IQR)Lower_Whisker = Q1–1.5*IQRUpper_Whisker = Q3+1.5*IQRprint(Lower_Whisker, Upper_Whisker)Output: - 96.5
180.0
83.5
-28.75 305.25

Step 5: — Outlier Treatment

Apply conditions to remove outliers:

df = df[df[‘hp’]< Upper_Whisker]

Outliers will be any points below Lower_Whisker or above Upper_Whisker

Step 6: — Check shape of data

6.2 — Z Score Method

Using Z Score we can find outlier

6.2.1 — What are criteria to identify an outlier?

Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation.

6.2.2 — Following are the steps to remove outlier

Step1: — Collect data and Read file

Step 2: — Check shape of data

Step 3: — Get the Z-score table.

from scipy import statsz=np.abs(stats.zscore(df.hp))print(z)

Step 4: -

We find the z-score for each of the data point in the dataset and if the z-score is greater than 3 than we can classify that point as an outlier. Any point outside of 3 standard deviations would be an outlier.

threshold=3print(np.where(z>3))

Output: -

(array([  8,  13,  95, 116], dtype=int64),)

Step 5: -

df1=df[(z< 3)]print(df1)

Step 6: — Check shape of data

The above Steps will remove the outliers from the dataset.

--

--