Leveraging Boxplot and Percentile for Outlier Detection and Removal
Data Preparation and Preprocessing for Machine Learning
What are Outliers?
In statistical terms, outliers are data points that deviate significantly from the majority of the dataset. These outliers can arise due to various reasons such as measurement errors, data entry mistakes, or even genuine extreme values. To give an example, suppose we want to find the average salary of people in a certain neighborhood, and due to maybe a data collection error, Elon Musk’s info was included as part of the data. In this situation, our calculation would not be a true representation of the average salary due to Elon’s salary. Identifying and handling outliers is crucial as they can distort statistical analyses and machine learning models. In this article, we will delve into two techniques for outlier detection and removal: boxplots and percentiles.
The dataset I will be using for illustration is the Paris Housing data which can be found below
Steps in Detecting and Removing Outliers
- The first step in potentially identifying outliers is to look at descriptive statistics of the data. Here, we can observe a very high standard deviation across both columns. Also, we can observe that the house with the highest squareMeters is 6071330 while 75% of the houses have a squareMeters less than 71547. This is where we start getting suspicious.
2. A picture speaks a thousand words they say. Below is a boxplot of both columns. Here we can clearly see the outliers and how they differ significantly from the rest of the data.
3. Tukey’s (1977) technique is used to detect outliers in skewed or non bell-shaped data since it makes no distributional assumptions. However, Tukey’s method may not be appropriate for a small sample size. The general rule is that anything not in the range of (Q1–1.5 IQR) and (Q3 + 1.5 IQR) is an outlier, and can be removed.
Inter Quartile Range (IQR) is one of the most extensively used procedures for outlier detection and removal.
Procedure:
Find the first quartile, Q1.
Find the third quartile, Q3.
Calculate the IQR. IQR = Q3 — Q1.
Define the normal data range with lower limit as Q1–1.5 IQR and upper limit as Q3 + 1.5 IQR.
Conclusion
A point to note is that we do not remove outliers in all cases. Removing outliers without proper justification can lead to loss of valuable information. We might not want to drop outliers when our results are critical or sensitive. Also, it might not be feasible when we have a lot of outliers. However, if outliers are due to data entry errors or measurement issues, removal might be appropriate.
Ultimately, it is crucial to strike a balance between preserving data integrity and ensuring the reliability of analyses when dealing with outliers.
All codes used in this article can be found on Github.