Outliers, How to find Outliers, and 5 Number Summary

Sachin Dev
3 min readDec 6, 2022

--

Have you ever come across Outliers while training a Machine Learning Model? or ever thought

What are Outliers?

Outliers are data points that are significantly different from the rest of the data in a dataset. They can be caused by errors in data collection, measurement, or analysis, or they can be entirely legitimate but unusual values. Outliers can skew the results of an analysis and can be removed before further statistical analysis is performed.

Example: We have a list of scores of students in a class such as
x = [3,30,35,25,45,50,48,40,32,38,22,25,29,90]
Here, both 3 and 90 are outliers.

Why outliers are important?

Outliers are important because they can provide valuable insight into data and can help uncover patterns and trends that may otherwise go unnoticed. They can also indicate errors or anomalies that need to be investigated further. Outliers can also help identify relationships between variables and can help identify which variables have a greater influence on the overall data.

Graph Showing Outlier

How to Detect Outliers?

We can easily find the outliers by plotting Box Plot.

BOXPLOT

Anything above and below these Whiskers is an Outlier.

Boxplot showing Outliers

5 Number Summary

The five-number summary is a set of descriptive statistics that provide information about a dataset. It is used to remove outliers. It consists of the following:

1: Minimum

The minimum value present in a dataset.

2: First Quartile (25th Percentile) or Q1

A Percentile is a value below which a certain percentage of observation lies.

For e.g. If a student got 99 Percentile marks, it means the student has got better marks than 99% of the total students.

So Q1 or 25th percentile is a measure of a distribution’s central tendency. It is the value for which 25% of the data points are below it and 75% are above it.

3: Median

A median is calculated by taking the middle number in a set of data when the data is arranged in a numeric order (ascending or descending). It is also called the middle value or 50th percentile. It gives a more accurate description of data as compared to the mean as it is not affected by outliers.

4: Third Quartile (75th Percentile) or Q3

Similar to Q1, Q3 or 75th Percentile is the value for which 75% of the data points are below it and 25% are above it.

5: Maximum

The maximum value present in a dataset.

Example of 5 Number Summary and how it is used to remove outliers?

Let’s say we have a dataset such that

X = [1,2,2,2,3,3,3,4,5,5,5,5,6,6,6,7,8,8,9,27]

We will try to create a fence and the value should range between the lower fence and the higher fence. Also, 27 is the outlier.

Inter Quartile Range (IQR) = Q3 — Q1
Lower Fence = Q1 — 1.5(IQR)
Higher Fence = Q3 + 1.5(IQR)

1: Minimum = 1

2: Q1 = (25/100)*(n+1) = 5.25th index
where n is the total number of data points.

To find the value at the 5.25th index we can take the average of the 5th and 6th indexes.

Q1 = (3+3)/2 = 3

3: Median = 5

4: Q3 = (75/100)*(n+1) = 15.75th index

Q3 = (8+7)/2 = 7.5

5: Maximum = 9

Lower Fence = 3 — (1.5)(4.5) = -3.65
Higher Fence = 7.5+ (1.5)(4.5) = 14.25

Boxplot using 5 number Summary

Thanks for reading this article! Leave a comment below if you have any questions.

--

--