# Outlier Detection with Multivariate Normal Distribution in Python

Oct 16 · 4 min read

All the code files will be available at : https://github.com/ashwinhprasad/Outliers-Detection/blob/master/Outliers.ipynb

# What is an Outlier ?

Anything that is unusual and deviates from the standard “normal” is called an Anomaly or an Outlier.
Detecting these anomalies in the given data is called as anomaly detection.

For more theoretical information about outlier or anomaly detection, Check out : How Anomaly Detection Works ?

# Why do we need to remove outliers or detect them ?

Case 1 : Consider a situation where a big manufacturing company is manufacturing an airplane. An airplane has different parts and we don’t want any parts to behave in an unusual way. these unusual behaviours might be because of various reasons. we want to detect these parts before it is fixed in an airplane else the lives of the passengers might be in danger.

Case 2: As you can see in the Above Image, how outliers can affect the equation of the line of best fit. So, before performing it is important to remove outliers in order to get the most accurate predictions.
In this post, I will be using Multivariate Normal Distribution

# Data Preparation

1. Importing the libraries
`import pandas as pdimport numpy as npimport randomimport matplotlib.pyplot as plt`

2. Creating Custom Dataset

`#define x1 and x2 x1 = np.arange(1,50,1) x2 = np.square(x1) + np.random.randint(-200,200)`

3. Adding an outlier to the dataset

`#adding outliersx1 = np.append(x1,17)x2 = np.append(x2,1300)data = np.stack((x1,x2),axis=1)plt.scatter(x1,x2)`

4. Visualizing the Dataset

Now , we have to try and detect the outlier from this dataset

# Anomaly Detection with Multivariate Normal Distribution

`from scipy.stats import multivariate_normal#calculate the covariance matrixdata = np.stack((x1,x2),axis=0)covariance_matrix = np.cov(data)#calculating the meanmean_values = [np.mean(x1),np.mean(x2)]#multivariate normal distributionmodel = multivariate_normal(cov=covariance_matrix,mean=mean_values)data = np.stack((x1,x2),axis=1)#finding the outliersthreshold = 1.0e-07outlier = model.pdf(data).reshape(-1) < threshhold`
1. In the first step, we are stacking the column x1 with x2 and storing it in the variable data
2. Calculate covariance matrix for data and mean of both x1 and x2
3. Multivariate_normal is a class in scipy which has a function named pdf which calculates the probability of a value being equal to each and every datapoint in the dataset. (the theory part to this is covered in How Anomaly Detection Works)
4. We calculate this probability for all the datapoints in the dataset
5. We also chose a threshold value and any datapoint that has a lower probability than this threshold values is considered as an anomaly and we create a boolean column for these values and store it in a variable outlier.
6. Here , (True = anomaly , false = not an anomaly )
`for boolean,i in enumerate(outlier):  if i == True:    print(data[boolean]," is an Outlier")output:[  17 1300]  is an Outlier`
1. We now know that all the datapoints that have lower probability than the threshold is marked true in the outlier list variable.
2. Now, we print out the outlier and check the value with the outlier that we injected in the dataset in the data creation part.
3. With this method, we have successfully found out the outlier that stood out from the pattern of the data

# Conclusion

Multivariate Normal Distribution is a very powerful tool for finding out outliers because this algorithm also takes into account how the variables change with other variables in the dataset which many other algorithms does not do.

Outlier detection is used in a lot of fields as in the example given at the top and is a must learn

Just a side note : Anomaly detection and removal is as important as removing an imposter in among us.
if not removed, it might affect the entire model or cause problems just like an imposter killing the crew mates or sabotaging the ship.

# Thank You

## Analytics Vidhya

### By Analytics Vidhya

Launching the Second Data Science Blogathon – An Unmissable Chance to Write and Win Prizesprizes worth INR 30,000+! Take a look

Written by

Written by