Outlier Treatment with Python
1 — What is an Outlier?
An outlier is a data point in a data set that is distant from all other observation.
A data point that lies outside the overall distribution of dataset
Many people get confused between Extreme values & Outliers.
2 — What is an Extreme Value?
An Extreme value is just a minimum or a maximum, it need not be much different from of the data.
3 — What is difference between Extreme value & Outlier?
An Extreme value is just a minimum or a maximum, it need not be much different from the data & a point that is far a way from the other points called as outlier.
Example: -Age of employees
Age = 21, 23, 24, 25, 26, 28, 30, 45
Where
Extreme value =30
Outlier =45
4 — What is the reason for an outlier to exist in dataset?
4.1- Variability in the data
4.2 - An Experimental measurement error
5 — How can we Identify an outlier?
5.1-Using Box plots
5.2-Using Scatter plot
5.3-Using Z score
6 — There are Two Methods for Outlier Treatment
- Interquartile Range(IQR) Method
- Z Score method
6.1 — IQR Method
Using IQR we can find outlier.
6.1.1 — What are criteria to identify an outlier?
Data point that falls outside of 1.5 times of an Interquartile range above the 3rd quartile (Q3) and below the 1st quartile (Q1)
6.2.2 — Removing Outliers using IQR
Step 1: — Collect and Read the Data
Step 2: — Check shape of data
Step 3: — Check Outliers
import seaborn as snssns.boxplot(data=df,x=df[‘hp’])
Step 4: — Implementation
Q1=df[‘hp’].quantile(0.25)Q3=df[‘hp’].quantile(0.75)IQR=Q3-Q1print(Q1)print(Q3)print(IQR)Lower_Whisker = Q1–1.5*IQRUpper_Whisker = Q3+1.5*IQRprint(Lower_Whisker, Upper_Whisker)Output: - 96.5
180.0
83.5
-28.75 305.25
Step 5: — Outlier Treatment
Apply conditions to remove outliers:
df = df[df[‘hp’]< Upper_Whisker]
Outliers will be any points below Lower_Whisker or above Upper_Whisker
Step 6: — Check shape of data
6.2 — Z Score Method
Using Z Score we can find outlier
6.2.1 — What are criteria to identify an outlier?
Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation.
6.2.2 — Following are the steps to remove outlier
Step1: — Collect data and Read file
Step 2: — Check shape of data
Step 3: — Get the Z-score table.
from scipy import statsz=np.abs(stats.zscore(df.hp))print(z)
Step 4: -
We find the z-score for each of the data point in the dataset and if the z-score is greater than 3 than we can classify that point as an outlier. Any point outside of 3 standard deviations would be an outlier.
threshold=3print(np.where(z>3))
Output: -
(array([ 8, 13, 95, 116], dtype=int64),)
Step 5: -
df1=df[(z< 3)]print(df1)
Step 6: — Check shape of data
The above Steps will remove the outliers from the dataset.