Outliers: Understanding, Detecting, and Handling the Data Anomalies

Saurabh Dhande
5 min readMar 10, 2023

--

In this blog we are going to discuss about outliers. We are going to cover following points in below blog

1]What is outlier

2] Importance of outlier analysis

3] Outlier detection method

4] How to handle outlier

Let’s deep dive into the details of each topic

1] What is outlier:

Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set.

Or

An outlier is an observation that is substantially different from the other observations. Outliers are important because they can change the results of our data analysis.

Or

In statistics, outliers are data points that don’t belong to a certain population. It is an abnormal observation that lies far away from other values. An outlier is an observation that diverges from otherwise well-structured data.

2]Importance of outlier analysis:

Outlier analysis is an essential part of data science, as it helps to improve the accuracy and reliability of statistical analyses and machine learning models. Outliers can significantly impact the results of data analyses, and identifying and addressing them can help ensure that data scientists are working with accurate information. By removing outliers or adjusting for their influence, data scientists can better understand their data and draw more meaningful conclusions from it.

3]Outlier detection method:

We can detect outliers in a dataset using four common methods as follows:

A)Standard Deviation (Z-Score):

In statistics, if a data distribution is approximately normal, then the majority of the data values (about 68%) are within one standard deviation of the mean, an even larger percentage (about 95%) are within two standard deviations, and an overwhelming majority (about 99.7%) lie within three standard deviations.

Therefore, if you have any data point that is more than 3 times the standard deviation, then those points are very likely to be anomalous or outliers

Z-Score = (X-Mean)/Std. Deviation

e.g. dataset = [1,2,3,1,2,3,1,2,3,31,3,2,1,3,2,1,3,2,1]

Step 1: Find the mean of the dataset

Mean = (1+2+3+1+2+3+1+2+3+31+3+2+1+3+2+1+3+2+1) / 19 = 4.05

Step 2: Calculate Standard Deviation (S.D.)

σ = sqrt(Σ(x-μ)²/N)

standard deviation = 5.28

Step 3 : z-score calculation

The z-score formula is:

z = (x — μ) / σ

where:

x is the raw score or value

μ is the mean of the population or sample

σ is the standard deviation of the population or sample

z-score for 1 = (1–4.05) / 5.28 = -0.58

z-score for 2 = (2–4.05) / 5.28 = -0.39

z-score for 3 = (3–4.05) / 5.28 = -0.21

z-score for 31 = (31–4.05) / 5.28 = 4.52

etc..

Step 4 : Identify outlier

In this case, we have one data point with a z-score is greater than 3 or less than -3, which is 31 Therefore, 31 is an outlier in this dataset.

B) Boxplot:

Box plots are a graphical depiction of numerical data through their quantiles. It is a very simple but effective way to visualize outliers. Think about the lower and upper whiskers as the boundaries of the data distribution. Any data points that show above or below the whiskers can be considered outliers or anomalous.

Source : Analytics Vidhya

C] Violin Plots:

Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator. Typically a violin plot will include all the data that is in a box plot: a marker for the median of the data, a box or marker indicating the interquartile range, and possibly all sample points if the number of samples is not too high.

D] Scatter Plots:

A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. The points which are very far away from the general spread of data and have a very few neighbours are considered to be outliers

4] How to handle outlier :

Following are the approaches to handle the outliers:

1. Drop the outlier records 2.

2. Assign a new value: If an outlier seems to be due to a mistake in your data, you try imputing a value.

3. If percentage-wise the number of outliers is less, but when we see numbers, there are several, then, in that case, dropping them might cause a loss in insight. We should group them in that case and run our analysis separately on them

Conclusion:

outliers can significantly impact statistical analyses and machine learning algorithms, leading to incorrect conclusions or predictions. Therefore, it is crucial to identify and handle outliers appropriately to ensure accurate results. Understanding the various techniques available for outlier detection and handling can help data analysts and machine learning practitioners improve the quality and reliability of their analyses and predictions.

In this blog, we provide basic information about outliers, and in the next blog, I will demonstrate how to remove them from a dataset using Python code.

If you’re interested in high-quality content from the field of data science, please follow my channel. Thank you for reading.

--

--