Outlier Analysis in Python

Mustafa Çelik
4 min readFeb 1, 2024

--

I will try to explain how to find and handle outliers in Python using the Interquartile Range (IQR) method.

Outliers are values that are significantly different from other values in the data set. These values mislead the functions and cause bias.

Outliers are an important issue to consider when preparing data for data analysis and machine learning. Outliers can affect the measures of central tendency, distribution and statistical test results. Therefore, there are various methods to detect outliers and handle them appropriately.

In this article, I will talk about the Interquartile Range (IQR) method, which is one of the commonly used Univariate methods for detecting outliers in Python. I will use modules such as pandas, numpy and scipy to implement this method. Also, I will use the diabetes dataset that I loaded from the sklearn library as an example.

We can visualize the outliers graphically with Box-plot and Histogram.

The Interquartile Range (IQR) method does not require the data to be close to a normal distribution. In this method, outliers are detected by using the interquartile difference of the data. The interquartile range is the difference between the third (Q3) and first (Q1) quartile of the data. This indicates the spread of the middle 50 percent of the data. The IQR method uses lower and upper limits calculated by the following formulas:

lower limit = Q1 — 1.5 × IQR

upper limit = Q3 + 1.5 × IQR

Where Q1 is the first quarter of the variable, Q3 is the third quarter of the variable and IQR is the difference between quarters. Values outside the lower and upper bounds are considered outliers.

Note that we can apply the IQR method to numeric columns.

To apply the IQR method, first load the data set and calculate the IQR, lower bound and upper bound of the values in the ‘bmi’ column.

import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd

diabetics = load_diabetes()
column_name = diabetics.feature_names
df = pd.DataFrame(diabetics.data)
df.columns = column_name

# We can use the pandas module to calculate IQR, lower bound and upper bound
q1 = df_diabetics['bmi'].quantile(0.25)
q3 = df_diabetics['bmi'].quantile(0.75)
iqr = q3 - q1
low_limit = q1 - 1.5 * iqr
up_limit = q3 + 1.5* iqr

df[(df["bmi"] < low_limit) | (df["bmi"] > up_limit)] # outlier values

df[(df["bmi"] < low_limit) | (df["bmi"] > up_limit)].any(axis=None) # check outlier

After calculating the IQR, lower bound and upper bound, we can find the indices of the values that are below the lower bound or above the upper bound. Using these indices, we can remove or replace outliers from the data set.

# Let's find the indices of values below the lower bound or above the upper bound
outlier_ind = df[(df['bmi'] < lower_limit) | (df['bmi'] > upper_limit)].index

The advantage of the IQR method is that it does not require the data to be close to a normal distribution and is not affected even if there are many outliers in the data set. It also helps to make a quick inference before conducting an in-depth analysis and provides the possibility to compare multiple variables at the same time. However, the disadvantage is that it assumes that the data is symmetric. This assumption may not always be valid for real data sets. Also, since the IQR method is based on quartiles of data, it may not provide enough information about the distribution of the data.

How do we solve the outlier problem?

  • Drop outlier

The simplest method to deal with outlier data is to delete outlier observations. Data deletion can affect the variability of the data. Therefore, it should not be the first choice and should be avoided.

# Let's use indices to remove outliers from the data set
new_df = df.drop(outlier_ind, axis=0)
print(new_df.shape)

When we delete outliers, statistical properties such as variability, central tendency, correlation and dispersion in our data set may change. In other words, when we delete the outlier observation, we also delete the exact observation in the other data. This can have a significant negative impact on data preprocessing for data analysis and machine learning.

  • Re-assignment with thresholds

Another strategy that can be applied to reduce the impact of outliers in the data set is suppression. Suppression method can be preferred in order not to lose data by deletion. The values above these values, that is, outliers, are suppressed by replacing them with threshold values.

# selecting outliers
df.loc[((df["bmi"] < l) | (df["bmi"] > u)), "bmi"]

# Re-assignment with thresholds
df.loc[(df["bmi"] < low_limit), "bmi"] = low_limit
df.loc[(df["bmi"] > up_limit), "bmi"] = up_limit

See you in the next articles…

For more: https://github.com/mstffclkk/machine_learning

--

--