Outlier Detection with Multivariate Normal Distribution in Python

Ashwin Prasad
Oct 16 · 4 min read
Image for post
Image for post

All the code files will be available at : https://github.com/ashwinhprasad/Outliers-Detection/blob/master/Outliers.ipynb

What is an Outlier ?

Anything that is unusual and deviates from the standard “normal” is called an Anomaly or an Outlier.
Detecting these anomalies in the given data is called as anomaly detection.

For more theoretical information about outlier or anomaly detection, Check out : How Anomaly Detection Works ?

Why do we need to remove outliers or detect them ?

Case 1 : Consider a situation where a big manufacturing company is manufacturing an airplane. An airplane has different parts and we don’t want any parts to behave in an unusual way. these unusual behaviours might be because of various reasons. we want to detect these parts before it is fixed in an airplane else the lives of the passengers might be in danger.

Image for post
Image for post

Case 2: As you can see in the Above Image, how outliers can affect the equation of the line of best fit. So, before performing it is important to remove outliers in order to get the most accurate predictions.
In this post, I will be using Multivariate Normal Distribution

Data Preparation

  1. Importing the libraries
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt

2. Creating Custom Dataset

#define x1 and x2 
x1 = np.arange(1,50,1)
x2 = np.square(x1) + np.random.randint(-200,200)

3. Adding an outlier to the dataset

#adding outliers
x1 = np.append(x1,17)
x2 = np.append(x2,1300)
data = np.stack((x1,x2),axis=1)
plt.scatter(x1,x2)

4. Visualizing the Dataset

Image for post
Image for post

Now , we have to try and detect the outlier from this dataset

Anomaly Detection with Multivariate Normal Distribution

Image for post
Image for post
from scipy.stats import multivariate_normal
  1. In the first step, we are stacking the column x1 with x2 and storing it in the variable data
  2. Calculate covariance matrix for data and mean of both x1 and x2
  3. Multivariate_normal is a class in scipy which has a function named pdf which calculates the probability of a value being equal to each and every datapoint in the dataset. (the theory part to this is covered in How Anomaly Detection Works)
  4. We calculate this probability for all the datapoints in the dataset
  5. We also chose a threshold value and any datapoint that has a lower probability than this threshold values is considered as an anomaly and we create a boolean column for these values and store it in a variable outlier.
  6. Here , (True = anomaly , false = not an anomaly )
for boolean,i in enumerate(outlier):
if i == True:
print(data[boolean]," is an Outlier")
  1. We now know that all the datapoints that have lower probability than the threshold is marked true in the outlier list variable.
  2. Now, we print out the outlier and check the value with the outlier that we injected in the dataset in the data creation part.
  3. With this method, we have successfully found out the outlier that stood out from the pattern of the data

Conclusion

Multivariate Normal Distribution is a very powerful tool for finding out outliers because this algorithm also takes into account how the variables change with other variables in the dataset which many other algorithms does not do.

Outlier detection is used in a lot of fields as in the example given at the top and is a must learn

Image for post
Image for post

Just a side note : Anomaly detection and removal is as important as removing an imposter in among us.
if not removed, it might affect the entire model or cause problems just like an imposter killing the crew mates or sabotaging the ship.

Thank You

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Data Science Blogathon: Win Lucrative Prizes!

By Analytics Vidhya

Launching the Second Data Science Blogathon – An Unmissable Chance to Write and Win Prizesprizes worth INR 30,000+! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Ashwin Prasad

Written by

Machine Learning | Deep Learning | Data Science | Web Dev

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Ashwin Prasad

Written by

Machine Learning | Deep Learning | Data Science | Web Dev

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store