Outliers detection and removal using IQR Method

4 min readJul 3, 2024

Outliers can significantly skew the results of data analysis and machine learning models. Identifying and handling outliers is crucial for creating robust models. In this article, we will explore outlier detection and removal for skewed data using Python, pandas, seaborn, and matplotlib.

Loading the Data

Let’s start by loading the dataset and taking a quick look at the first few rows:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('placement.csv')
df.head()

The dataset contains three columns: cgpa, placement_exam_marks, and placed. We'll focus on the placement_exam_marks column for outlier detection and removal.

Visualizing the Data Distribution

Before detecting outliers, it’s important to understand the data distribution. We will use seaborn to plot the distribution of cgpa and placement_exam_marks.

plt.figure(figsize=(16,5))

plt.subplot(1,2,1)
sns.distplot(df['cgpa'])
plt.title('CGPA Distribution')

plt.subplot(1,2,2)
sns.distplot(df['placement_exam_marks'])
plt.title('Placement Exam Marks Distribution')

plt.show()

df['placement_exam_marks'].describe()

Box Plot Visualization

A box plot helps visualize the presence of outliers.

sns.boxplot(df['placement_exam_marks'])
plt.title('Boxplot of Placement Exam Marks')
plt.show()

Finding the IQR

The Interquartile Range (IQR) is a measure of statistical dispersion and is useful for identifying outliers. We calculate the IQR and use it to find the upper and lower limits for outliers.

percentile25 = df['placement_exam_marks'].quantile(0.25)
percentile75 = df['placement_exam_marks'].quantile(0.75)

iqr = percentile75 - percentile25

upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr

print("Upper limit", upper_limit)
print("Lower limit", lower_limit)

Detecting Outliers

We can now identify the outliers that are above the upper limit or below the lower limit.

outliers_above = df[df['placement_exam_marks'] > upper_limit]
outliers_below = df[df['placement_exam_marks'] < lower_limit]

print(outliers_above)

print("Outliers below lower limit:")
print(outliers_below)

Outliers below lower limit:

Removing Outliers (Trimming)

One way to handle outliers is by removing them. This method is known as trimming.

new_df = df[df['placement_exam_marks'] < upper_limit]
new_df.shape

Comparing Before and After Trimming

We can compare the distribution and box plot before and after trimming the outliers.

plt.figure(figsize=(16,8))

plt.subplot(2,2,1)
sns.distplot(df['placement_exam_marks'])
plt.title('Original Distribution')

plt.subplot(2,2,2)
sns.boxplot(df['placement_exam_marks'])
plt.title('Original Boxplot')

plt.subplot(2,2,3)
sns.distplot(new_df['placement_exam_marks'])
plt.title('Trimmed Distribution')

plt.subplot(2,2,4)
sns.boxplot(new_df['placement_exam_marks'])
plt.title('Trimmed Boxplot')

plt.show()

Capping Outliers

Another method to handle outliers is capping, where we replace outliers with the upper or lower limit.

new_df_cap = df.copy()

new_df_cap['placement_exam_marks'] = np.where(
    new_df_cap['placement_exam_marks'] > upper_limit,
    upper_limit,
    np.where(
        new_df_cap['placement_exam_marks'] < lower_limit,
        lower_limit,
        new_df_cap['placement_exam_marks']
    )
)

Comparing Before and After Capping

Finally, we compare the distribution and box plot before and after capping the outliers.

plt.figure(figsize=(16,8))

plt.subplot(2,2,1)
sns.distplot(df['placement_exam_marks'])
plt.title('Original Distribution')

plt.subplot(2,2,2)
sns.boxplot(df['placement_exam_marks'])
plt.title('Original Boxplot')

plt.subplot(2,2,3)
sns.distplot(new_df_cap['placement_exam_marks'])
plt.title('Capped Distribution')

plt.subplot(2,2,4)
sns.boxplot(new_df_cap['placement_exam_marks'])
plt.title('Capped Boxplot')

plt.show()

Conclusion

Handling outliers is a crucial step in data preprocessing, especially for skewed data. Trimming and capping are two effective methods for managing outliers. By carefully handling outliers, we can ensure that our data analysis and machine learning models are more accurate and reliable.