Using Inter Quartile Range to Remove Outliers From a Dataset |Data Cleaning |Exploratory Data Analysis

Rina Mondal
3 min readJun 29, 2024

--

Outliers are data points that differ significantly from other observations in a dataset. They can be unusually high or low compared to the rest of the data. Outliers can arise due to variability in the data or may indicate measurement or input errors.

They can have significant effects on statistical analyses and can sometimes skew results. Hence, it is very important that you remove the outliers before you model your dataset. I have explained this concept in my Youtube channel. This topic is completely explained in my youtube channel.

Now, let’s check this dataset having outliers:

import pandas as pd

# Create the dataset
data = {
'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25],
'Age': [25, 30, 22, 35, 45, 50, 40, 55, 28, 60, 26, 32, 24, 29, 37, 23, 27, 48, 52, 39, 41, 31, 33, 20, 19],
'Income': [300, 52000, 48000, 55000, 60000, 62000, 58000, 65000, 49500, 70000, 51000, 53000, 46000, 51500, 150000, 170000, 50500, 59000, 63000, 57500, 58500, 52500, 54000, 45000, 44000],
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('dataset_with_outliers.csv', index=False)

The following tools are helpful when dealing with outliers in a dataset:

df.describe(): A DataFrame method that returns general statistics about the data frame which can help determine outliers.

df.describe()

From this method, we can have an idea whether the dataset contains outliers or not.

Now, we can do boxplot to understand if the dataset contains outliers.

Boxplot(): Data points beyond 1.5x the interquartile range are considered outliers.

Let’s do a boxplot using Seaborn:

  1. Generate and Plot the Data:
# Create boxplots
plt.figure(figsize=(15, 5))

# Boxplot for Income
plt.subplot(1, 3, 2)
sns.boxplot(y=df['Income'])
plt.title('Boxplot of Income')

# Show the plot
plt.show()

2. Find the Outliers:

# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)

# Calculate the IQR (Interquartile Range)
IQR = Q3 - Q1

# Calculate the lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify the outliers
outliers = df[(df['Income'] < lower_bound) | (df['Income'] > upper_bound)]

# Display the outliers
print("Outliers in the 'Income' column:")
print(outliers)

3. Remove Outliers and Plot Cleaned Data:

# Remove the outliers
df_cleaned = df[(df['Income'] >= lower_bound) & (df['Income'] <= upper_bound)]

# Save the cleaned dataset to a new CSV file
df_cleaned.to_csv('dataset_without_outliers.csv', index=False)

# Print the cleaned dataset
print("Dataset without outliers saved to 'dataset_without_outliers.csv'.")

# Create the boxplot for the cleaned dataset
plt.figure(figsize=(15, 5))

# Boxplot for Income (cleaned data)
plt.subplot(1, 3, 2)
sns.boxplot(y=df_cleaned['Income'])
plt.title('Boxplot of Income (Cleaned Data)')

# Show the plot
plt.show()

Why do we need to remove Outliers?

Outliers can create several types of problems in data analysis and statistical modeling:

1. Skewing Statistical Measures: Outliers can significantly affect the mean and standard deviation of the dataset, leading to misleading interpretations of central tendency and variability.

2. Impact on Parametric Tests: Outliers violate the assumptions of many parametric statistical tests (e.g., t-tests, ANOVA), leading to inaccurate results and conclusions.

3. Distorted Relationships: Outliers can distort relationships and correlations between variables, making them appear stronger or weaker than they actually are.

4. Reduced Model Accuracy: In predictive modeling, outliers can lead to overfitting, where the model performs well on training data but poorly on new data due to fitting to noise.

Addressing outliers appropriately through techniques like outlier detection, handling, or removal is crucial to mitigate these issues and ensure robust data analysis and modeling.

Instead of deleting, if you want to mitigate the effects of outliers, then log transformation can be applied..

Blogs Related to Data Cleaning:

  1. Complete Data Cleaning.
  2. Remove Outliers using Z-Score
  3. Using Log Transformation to mitigate the effect of outliers

Complete Data Science Roadmap.

Give it :👏👏👏👏:
If you found this guide helpful , why not show some love? Give it a Clap 👏, and if you have questions or topics you’d like to explore further, drop a comment 💬 below 👇. If you appreciate my hard work please follow me. That is the only way I can continue my passion.

--

--

Rina Mondal

I have an 8 years of experience and I always enjoyed writing articles. If you appreciate my hard work, please follow me, then only I can continue my passion.