All the code files will be available at : https://github.com/ashwinhprasad/Outliers-Detection/blob/master/Outliers.ipynb
What is an Outlier ?
Anything that is unusual and deviates from the standard “normal” is called an Anomaly or an Outlier.
Detecting these anomalies in the given data is called as anomaly detection.
For more theoretical information about outlier or anomaly detection, Check out : How Anomaly Detection Works ?
Why do we need to remove outliers or detect them ?
Case 1 : Consider a situation where a big manufacturing company is manufacturing an airplane. An airplane has different parts and we don’t want any parts to behave in an unusual way. these unusual behaviours might be because of various reasons. we want to detect these parts before it is fixed in an airplane else the lives of the passengers might be in danger.
Case 2: As you can see in the Above Image, how outliers can affect the equation of the line of best fit. So, before performing it is important to remove outliers in order to get the most accurate predictions.
In this post, I will be using Multivariate Normal Distribution
- Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
2. Creating Custom Dataset
#define x1 and x2
x1 = np.arange(1,50,1)
x2 = np.square(x1) + np.random.randint(-200,200)
3. Adding an outlier to the dataset
x1 = np.append(x1,17)
x2 = np.append(x2,1300)
data = np.stack((x1,x2),axis=1)
4. Visualizing the Dataset
Now , we have to try and detect the outlier from this dataset
Anomaly Detection with Multivariate Normal Distribution
from scipy.stats import multivariate_normal#calculate the covariance matrix
data = np.stack((x1,x2),axis=0)
covariance_matrix = np.cov(data)
#calculating the mean
mean_values = [np.mean(x1),np.mean(x2)]
#multivariate normal distribution
model = multivariate_normal(cov=covariance_matrix,mean=mean_values)
data = np.stack((x1,x2),axis=1)
#finding the outliers
threshold = 1.0e-07
outlier = model.pdf(data).reshape(-1) < threshhold
- In the first step, we are stacking the column x1 with x2 and storing it in the variable data
- Calculate covariance matrix for data and mean of both x1 and x2
- Multivariate_normal is a class in scipy which has a function named pdf which calculates the probability of a value being equal to each and every datapoint in the dataset. (the theory part to this is covered in How Anomaly Detection Works)
- We calculate this probability for all the datapoints in the dataset
- We also chose a threshold value and any datapoint that has a lower probability than this threshold values is considered as an anomaly and we create a boolean column for these values and store it in a variable outlier.
- Here , (True = anomaly , false = not an anomaly )
for boolean,i in enumerate(outlier):
if i == True:
print(data[boolean]," is an Outlier")output:
[ 17 1300] is an Outlier
- We now know that all the datapoints that have lower probability than the threshold is marked true in the outlier list variable.
- Now, we print out the outlier and check the value with the outlier that we injected in the dataset in the data creation part.
- With this method, we have successfully found out the outlier that stood out from the pattern of the data
Multivariate Normal Distribution is a very powerful tool for finding out outliers because this algorithm also takes into account how the variables change with other variables in the dataset which many other algorithms does not do.
Outlier detection is used in a lot of fields as in the example given at the top and is a must learn
Just a side note : Anomaly detection and removal is as important as removing an imposter in among us.
if not removed, it might affect the entire model or cause problems just like an imposter killing the crew mates or sabotaging the ship.