# Finding outliers in dataset using python

*In this article, we will use z score and IQR -interquartile range to identify any outliers using python*

*Jupyter notebook is available at- *https://github.com/arshren/MachineLearning/blob/master/Identifying%20outliers.ipynb

*What is an outlier?*

An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.

*What are the criteria to identify an outlier?*

- Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile
- Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation

*What is the reason for an outlier to exists in a dataset?*

An outlier could exist in a dataset due to

- Variability in the data
- An experimental measurement error

*What is the impact of an outlier?*

causes serious issues for statistical analysis

- skew the data,
- significant impact on mean
- significant impact on standard deviation.

*How can we identify an outlier?*

- using scatter plots
- using Z score
- using the IQR interquartile range

## Using Scatter Plot

We can see the scatter plot and it shows us if a data point lies outside the overall distribution of the dataset

**Using Z score**

**Formula for Z score = (Observation — Mean)/Standard Deviation**

**z = (X — μ) / σ**

we first import the libraries

`import numpy as np`

import pandas as pd

we will use a list here

`dataset= [10,12,12,13,12,11,14,13,15,10,10,10,100,12,14,13, 12,10, 10,11,12,15,12,13,12,11,14,13,15,10,15,12,10,14,13,15,10]`

we write a function that takes numeric data as an input argument.

we find the mean and standard deviation of the all the data points

We find the z score for each of the data point in the dataset and if the z score is greater than 3 than we can classify that point as an outlier. Any point outside of 3 standard deviations would be an outlier.

`import numpy as np`

import pandas as pd

outliers=[]

def detect_outlier(data_1):

threshold=3

mean_1 = np.mean(data_1)

std_1 =np.std(data_1)

for y in data_1:

z_score= (y - mean_1)/std_1

if np.abs(z_score) > threshold:

outliers.append(y)

return outliers

we now pass dataset that we created earlier and pass that as an input argument to the detect_outlier function

`outlier_datapoints = detect_outlier(dataset)`

print(outlier_datapoints)

## Using IQR

IQR tells how spread the middle values are. It can be used to tell when a value is too far from the middle.

An outlier is a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile.

we will use the same dataset

step 1:

- Arrange the data in increasing order
- Calculate first(q1) and third quartile(q3)
- Find interquartile range (q3-q1)
- Find lower bound q1*1.5
- Find upper bound q3*1.5
- Anything that lies outside of lower and upper bound is an outlier

Fist sorting the dataset

`sorted(dataset)`

Finding first quartile and third quartile

`q1, q3= np.percentile(dataset,[25,75])`

q1 is 11 and q3 is 14

Find the IQR which is the difference between third and first quartile

`iqr = q3 - q1`

iqr is 3

Find lower and upper bound

`lower_bound = q1 -(1.5 * iqr) `

upper_bound = q3 +(1.5 * iqr)

lower_bound is 6.5 and upper bound is 18.5, so anything outside of 6.5 and 18.5 is an outlier.