Beyond the Norms: World of Outliers
How the outliers keep Data Scientist on their toes
Outliers are points which are at abnormal distance from most of the points in dataset. Before we deep dive into solving any machine learning problem it is the key step to remove outliers if not needed. Let us discuss the roots of an Outlier.
Common causes of Outliers
- Data Entry error (human error)
- Experimental error
- Measurement error (instrumental error)
- Sampling error (extracting data from various/wrong sources)
- Intentional error (dummy outliers created to test detection methods)
- Natural error (not an error, novelties in data)
Common methods of determining outliers
1. IQR (Interquartile range)
The IQR is the middle 50% of the dataset. It’s the range of values
between the third quartile and the first quartile (Q3 — Q1).Used to measure variability by dividing the dataset into quartiles. Quartiles are values that divide your data in 4 parts provided data is sorted in ascending order.
IQR = Q3 — Q1
Q1 = 1st quartile (lower quartile which is 25th percentile that divides lower 25% of data)
Q2 = 2nd quartile (median which is 50th percentile)
Q3 = 3rd quartile (upper quartile which is 75th percentile that divides upper 25% of data)
Note : Percentage and percentile are two different things. If 25th percentile is 8 then it simply means 25% of data is less than 8. If 75th percentile is 40 then it simply means 75% of data is less than 40.
If data value < Q1–1.5(IQR) OR data value > Q3+1.5(IQR) then it is treated as an outlier.
#Python code for finding outliers
import numpy as np
data = [1,4,6,3,500,24,53]
sort_data = np.sort(data)
Q1 = np.percentile(data,25)
Q3 = np.percentile(data,75)
IQR = Q3-Q1
low_limit = Q1 - 1.5(IQR)
upper_limit = Q3 + 1.5(IQR)
outlier = []
for i in data:
if(i>upper_limit or i<low_limit):
outlier.append()
return outlier
2. Z score
It is also known as standard score that gives us an idea how far a data point is from mean. It tells how far a data point deviates from mean in standard deviations. We know that if the data follow normal distribution then the data covers 99.7% of the points up to 3 standard deviation. We can have our outliers calculated beyond that on both sides.
So if we get z-score as 2.5 then we say it is 2.5 standard deviation above average and if we get -2.5 then we say it is 2.5 standard deviation below average. Therefore we can conclude that z-score is no of standard deviation above or below that mean that each value falls.
The main advantage of z-score is that it tell you how much value in % is an outlier.
Z Score = (x-μ)/σ
x is an observation in the sample
x̄ is the mean of the observations in the sample
σ is the standard deviation of the observations in the sample
3. Sort data and see extreme values
This is the basic method where you can sort the data. After that look for extreme values and that will our outlier.
For Example, We have been given age as 4,6,9,2,10,12,102.
Step 1 : Sort data : 2, 4,6,9,12,102
Steps 2: Spot for extreme values we can see 102 is extreme value so that could be an outlier for us.
4. Plotting scatter plot, boxplot
- Scatterplot : It is great indicator that allows us to see whether there is pattern between two variables. It is used when you pair numerical data or when you are determining relationship between two variables. But not only this, you can also use it for outlier detection.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("heart.csv")
fig,ax = plt.subplots(figsize = (10,6))
# Scatter with positive examples
pos = ax.scatter(df.age[df.target==1], df.thalach[df.target == 1], color="salmon", label="Heart Disease")
# Scatter with negative examples
neg = ax.scatter(df.age[df.target==0], df.thalach[df.target == 0], color="lightblue", label="No Heart Disease")
#customization
plt.title("Max heart rate in comparison to age")
plt.xlabel("Age")
plt.ylabel("Max heart rate")
plt.legend()
plt.show()
- Boxplot : It summarizes sample data using 25th percentile, 50th percentile and 75th percentile. One can get insights about quartiles, median and outliers.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("diabetes.csv")
#visualazing each feature to detect the presence of outliers
fig, axis = plt.subplots(nrows=1, ncols=8, figsize=(16, 6))
sns.boxplot(data=df[['Pregnancies']], ax=axis[0]);
sns.boxplot(data=df[['Glucose']], ax=axis[1]);
sns.boxplot(data=df[['BloodPressure']],ax=axis[2]);
sns.boxplot(data=df[['SkinThickness']], ax=axis[3]);
sns.boxplot(data=df[['Insulin']], ax=axis[4]);
sns.boxplot(data=df[['BMI']], ax=axis[5]);
sns.boxplot(data=df[['DiabetesPedigreeFunction']],ax=axis[6]);
sns.boxplot(data=df[['Age']], ax=axis[7]);
5. Hypothesis Testing
You can use hypothesis tests to find outliers. Many outlier tests exist, but I’ll focus on one to illustrate how they work. I will demonstrate Grubbs’ test, which tests the following hypotheses:
Null: All values in the sample were drawn from a single population that follows the same normal distribution.
Alternative: One value in the sample was not drawn from the same normally distributed population as the other values.
If the p-value for this test is less than your significance level, you can reject the null and conclude that one of the values is an outlier.
Thanks for reading! If you enjoyed this piece and would like to read more of my work, please consider following me on Medium. I look forward to sharing more with you in the future.