# Outlier Detection with Multivariate Normal Distribution in Python

*All the code files will be available at : **https://github.com/ashwinhprasad/Outliers-Detection/blob/master/Outliers.ipynb*

**What is an Outlier ?**

Anything that is unusual and deviates from the standard “normal” is called an **Anomaly **or an** Outlier.**Detecting these anomalies in the given data is called as anomaly detection.

For more theoretical information about outlier or anomaly detection, Check out :** How Anomaly Detection Works ?**

# Why do we need to remove outliers or detect them ?

**Case 1 : **Consider a situation where a big manufacturing company is manufacturing an airplane. An airplane has different parts and we don’t want any parts to behave in an unusual way. these unusual behaviours might be because of various reasons. we want to detect these parts before it is fixed in an airplane else the lives of the passengers might be in danger.

**Case 2: **As you can see in the Above Image, how outliers can affect the equation of the line of best fit. So, before performing it is important to remove outliers in order to get the most accurate predictions.

In this post, I will be using Multivariate Normal Distribution

# Data Preparation

- Importing the libraries

**import** **pandas** **as** **pd**

**import** **numpy** **as** **np**

**import** **random**

**import** **matplotlib.pyplot** **as** **plt**

2. Creating Custom Dataset

*#define x1 and x2*

x1 = np.arange(1,50,1)

x2 = np.square(x1) + np.random.randint(-200,200)

3. Adding an outlier to the dataset

*#adding outliers*

x1 = np.append(x1,17)

x2 = np.append(x2,1300)

data = np.stack((x1,x2),axis=1)

plt.scatter(x1,x2)

4. Visualizing the Dataset

Now , we have to try and detect the outlier from this dataset

# Anomaly Detection with Multivariate Normal Distribution

fromscipy.statsimportmultivariate_normal#calculate the covariance matrix

data = np.stack((x1,x2),axis=0)

covariance_matrix = np.cov(data)#calculating the mean

mean_values = [np.mean(x1),np.mean(x2)]#multivariate normal distribution

model = multivariate_normal(cov=covariance_matrix,mean=mean_values)

data = np.stack((x1,x2),axis=1)#finding the outliers

threshold = 1.0e-07

outlier = model.pdf(data).reshape(-1) < threshhold

- In the first step, we are stacking the column
**x1**with**x2**and storing it in the variable**data** - Calculate
**covariance matrix**for data and**mean**of both**x1**and**x2** **Multivariate_normal**is a class in**scipy**which has a function named**pdf**which calculates the probability of a value being equal to each and every datapoint in the dataset. (the theory part to this is covered in How Anomaly Detection Works)- We calculate this probability for all the datapoints in the dataset
- We also chose a
**threshold value**and any datapoint that has a lower probability than this threshold values is considered as an anomaly and we create a boolean column for these values and store it in a variable**outlier**. - Here , (True = anomaly , false = not an anomaly )

forboolean,iinenumerate(outlier):

ifi ==True:

print(data[boolean]," is an Outlier")output:[ 17 1300] is an Outlier

- We now know that all the datapoints that have lower probability than the threshold is marked true in the outlier list variable.
- Now, we print out the outlier and check the value with the outlier that we injected in the dataset in the data creation part.
- With this method, we have successfully found out the outlier that stood out from the pattern of the data

# Conclusion

Multivariate Normal Distribution is a very powerful tool for finding out outliers because this algorithm also takes into account how the variables change with other variables in the dataset which many other algorithms does not do.

Outlier detection is used in a lot of fields as in the example given at the top and is a must learn

Just a side note : Anomaly detection and removal is as important as removing an imposter in among us.

if not removed, it might affect the entire model or cause problems just like an imposter killing the crew mates or sabotaging the ship.