A Beginner Guide for Identifying and Handling Outliers

For Aspiring Data Scientists

Amsavalli Mylasalam
Variablz Academy
4 min readJan 3, 2023

--

Have you ever noticed, while going through the rain, that one person walks casually without an umbrella and does not worry about anything while everyone is running away from the rain? Exactly that person is the outlier 😅. Fun apart, Let’s talk technically…

Handling Outliers — Credits: Aatomz

Outliers are values within a dataset but unusual to other typical values for that attribute. Outliers may imply variabilities in a measure, observed errors, or newness.

Should I Remove all the Outliers?

According to the effect of outliers, we can decide whether we can keep it or drop it. If outliers negatively affect the result of an analysis, you can drop it, and else If outliers may be required information for data analysis, you can keep it.

Anomaly detection is used to find and identify outliers and helps to prevent fraud, opponent attacks, and network intrusions that can compromise your company’s future.

If you’re confident that the outlier is virtually skewed and incorrect, or if the data set is large enough that removing it won’t hurt the data, then it’s safe to be dropped.

I usually run the results with & without the outlier to see if there’s any substantial difference and finalize the best outcomes rather than dropping it.

What is the cause of outliers in the dataset?

The most common causes of outliers in the datasets are Typo Errors, Measurement Errors, Data Processing errors, and Naturally-occurring Errors.

Handling Outliers

  1. Sort Values
  2. Box Plot Visualization/Inter-Quartile Range (IQR)
  3. How to Graph Your Data to Find Outliers
  4. Z- Score Calculation

1. Sort Values

In the Sorting method, You can sort numeric variables from low to high and observe for too-low or too-high values. Mark any drastic values that you find. Sorting values is easy to check whether you need to analyze specific data points before using other methods.

2. Box Plot Visualization/Inter-Quartile Range (IQR)

Box plot visually exhibits the statistical summary of a dataset, Min, Q1, Q2 (median), Q3, and Max. Even it indicates the outliers according to the IQR method. When an outlier is detected, the whisker will correspondingly change to the upper limit (Q3+1.5*IQR) or lower limit (Q1–1.5*IQR).

Quartile 1 (Q1) represents 25th percentile
Quartile 2 (Q2) represents 50th percentile
Quartile 3 (Q3) represents 75th percentile

Implementation of IQR

To explain the IQR method, I am taking the Cardiovascular Disease dataset from kaggle. Let’s import the library and read the file.

Input:

From this box plot, you can now analyze the existence of outliers; let’s create a function of the IQR method to detect and remove outliers for these features.

Removing outliers first, then visualize these features using a boxplot.

3. How to Graph Your Data to Find Outliers

Three graphical methods can be applied to find outliers. Outliers can be highlighted using scatterplots, histograms, and boxplots.

Using Graphical Methods, I am finding Outliers for the Height attribute.

Input:

Output:

Histograms make it simple to identify outliers. For instance, the extreme left point in the preceding graph is an outlier.

The box plot graph shows that values above 200 and below 130 depict outliers.

A point over 1.5 times the interquartile range above or below the third quartile is a convenient way to define an outlier.

4. Z-Score Method to identify Outliers

Z-score helps us to understand if a data value is greater or smaller than the mean and how outlying it is from the mean. Z score conveys how many standard deviations away a data point is from the mean.

Z > ± 3 OUTLIERS

A data point is significantly different from the other data points if its z score is more than 3. Such a data point like that may be an outlier.

Implementing calculate Zscore function for height attribute

This article shows commonly-used processes of identifying outliers, then demonstrates how these outliers may end up in a dataset to build valuable insights.

Handling outliers is a complex process. However, for analytical people, it’s so exciting 👩‍🏫.

Connect with me for data science talks

https://www.linkedin.com/in/amsavalli-datascientist/

Thanks & Regards

AMSAVALLI — Data Scientist

--

--