Leveraging Boxplot and Percentile for Outlier Detection and Removal

Data Preparation and Preprocessing for Machine Learning

Oyebamiji Micheal
3 min readAug 27, 2023

What are Outliers?

In statistical terms, outliers are data points that deviate significantly from the majority of the dataset. These outliers can arise due to various reasons such as measurement errors, data entry mistakes, or even genuine extreme values. To give an example, suppose we want to find the average salary of people in a certain neighborhood, and due to maybe a data collection error, Elon Musk’s info was included as part of the data. In this situation, our calculation would not be a true representation of the average salary due to Elon’s salary. Identifying and handling outliers is crucial as they can distort statistical analyses and machine learning models. In this article, we will delve into two techniques for outlier detection and removal: boxplots and percentiles.

The dataset I will be using for illustration is the Paris Housing data which can be found below

Steps in Detecting and Removing Outliers

  1. The first step in potentially identifying outliers is to look at descriptive statistics of the data. Here, we can observe a very high standard deviation across both columns. Also, we can observe that the house with the highest squareMeters is 6071330 while 75% of the houses have a squareMeters less than 71547. This is where we start getting suspicious.
Descriptive statistics of columns

2. A picture speaks a thousand words they say. Below is a boxplot of both columns. Here we can clearly see the outliers and how they differ significantly from the rest of the data.

A boxplot showing outliers

3. Tukey’s (1977) technique is used to detect outliers in skewed or non bell-shaped data since it makes no distributional assumptions. However, Tukey’s method may not be appropriate for a small sample size. The general rule is that anything not in the range of (Q1–1.5 IQR) and (Q3 + 1.5 IQR) is an outlier, and can be removed.

Inter Quartile Range (IQR) is one of the most extensively used procedures for outlier detection and removal.

Procedure:

Find the first quartile, Q1.
Find the third quartile, Q3.
Calculate the IQR. IQR = Q3 — Q1.
Define the normal data range with lower limit as Q1–1.5 IQR and upper limit as Q3 + 1.5 IQR.

Removing outliers programmatically

Conclusion

A point to note is that we do not remove outliers in all cases. Removing outliers without proper justification can lead to loss of valuable information. We might not want to drop outliers when our results are critical or sensitive. Also, it might not be feasible when we have a lot of outliers. However, if outliers are due to data entry errors or measurement issues, removal might be appropriate.

Ultimately, it is crucial to strike a balance between preserving data integrity and ensuring the reliability of analyses when dealing with outliers.

All codes used in this article can be found on Github.

References

Marcin Ruteki Regression Model Evaluation Metrics

You made it to the end of the article! Thanks for reading and hope you learned a lot, if you like my content and want to connect with me you can do that by:

  1. Following me on Medium.
  2. Connecting with me on Twitter.
  3. Checking out my work on Github.

--

--

Oyebamiji Micheal

Proffering solutions to real world problems using data science and machine learning along with advanced statistics, data structures and algorithms