Handling Outliers in Datasets

Dhiraj Mishra
7 min readSep 16, 2020

--

Photo by Danielle MacInnes on Unsplash

Table of Content

  1. Definition of Outliers
  2. Different types of Outliers
  3. Ways to deal with Outliers
  4. Optional Content about SD & Variance
  5. Standard Deviation Method
  6. Interquartile Range Method(IQR)
  7. Automatic Outliers detection

Definition of Outliers

An outlier is an unlikely observation in a dataset. It is rare, or distinct, or does not fit in some way.

Different types of Outliers:

Outliers can have many causes, such as:

  • Measurement or Manual error
  • Data generation flaw
  • Data corruption
  • True outlier observation (E.g. Sachin tendulkar/Virat Kohli in Cricket)

There is no precise way to identify an outlier, domain expert needs to interpret the raw data and decide whether a value is an outlier or not.

Ways to deal with Outliers

  • Standard Deviation Method
  • Interquartile Range Method (IQR)
  • Automatic Outlier Detection

Optional Content about SD& Variance

Variance: In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.
Informally, it measures how far a set of numbers is spread out from their average value.

My photography :)

S² = sample variance
X = the value of the one observation
μ = the mean value of all observations
N = the number of observations

Standard Deviation: In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values.
A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

My photography :)

Variance (Sigma)² = average squared deviation of values from mean.
Standard deviation (Sigma) = square root of the variance

As we square the deviations while calculating variance the unit changes E.g. lengths measured in metres(m) have a variance measured in metres squared (m²). Taking the square root of the variance gives us the units used in the original scale and this is the standard deviation.

Standard deviation is a measures of spread around the mean. As it is closely linked with the mean, it is greatly affected by the outliers.

# When 2 datasets have same or almost similar mean?

  • SD is useful when comparing the spread of two separate data sets that have approximately the same mean. The data set with the smaller standard deviation has a narrower spread of measurements around the mean and therefore having comparatively fewer high or low values.
    But there are few considerations while checking the SD which have same mean.

Consider you are trying to compare two Cricket players having a difference of 3 centuries is pretty close but if we on the other hand compare the number of matches played has a difference of 30 then they are considered to be far apart. It is always useful to access the SD based on the mean value.

Important facts about Standard deviation & Variance

  1. Standard deviation:
  • Standard deviation is never negative.
  • Standard deviation is sensitive to outliers.

2. Variance:

  • Doubles the unit of measurement
  • Variance is never negative.

If all values of a data set are the same, the standard deviation is zero

When analyzing normally distributed data, standard deviation can be used in combination with the mean in order to calculate data intervals.

If x bar = mean, SD = standard deviation and x = a value in the data set, then

  • 68% of the data lie in the interval: mean — SD < x < mean + SD
  • 95% of the data lie in the interval: mean — 2SD< x <mean + 2SD
  • 99% of the data lie in the interval: mean — 3SD < x < mean + 3SD
Don’t go Away :)
Just to grab your attention, Photo by Joe Caione on Unsplash

Standard Deviation Method

If we know that the distribution of data we have follows a Normal distribution then we can use standard deviation method to handle outliers.
Gaussian distribution is also commonly called the “normal distribution” and is often described as a “bell-shaped curve” is one of the methods to handle outliers.

The nature of the Gaussian gives a probability of 0.683 of being within one standard deviation of the mean i.e. within one standard deviation of the mean will cover 68% of the data.

Consider, we have mean of 30 and the standard deviation is 4, then all data in the datasets between 26 and 34 will account for about 68% of the data sample. We can cover more of the data sample if we expand the range as follows:

1 Standard Deviation from the Mean: 68%
2 Standard Deviations from the Mean: 95%
3 Standard Deviations from the Mean: 99.7%
4 Standard Deviations from the Mean: 99.9%

How to implement using Python

# calculate mean & standard deviation
data_mean = np.mean(data)
data_std = np.std(data)
# identifying outliers
cut_off = data_std * 3 # We are considering 3SD away..please refer optional section if you have any doubts.
lower = data_mean — cut_off
upper = data_mean + cut_off

We can then identify outliers as those examples that fall outside of the defined lower and upper limits.

outliers = [x for x in data if x < lower or x > upper]
actual_data = [x for x in data if x >= lower and x <= upper]

Interquartile Range Method (IQR)

All data points in real-world are not normally distributed. The interquartile range is a measure of where the “middle fifty” is in a data set.
Where a range is a measure of where the beginning and end are in a set, an interquartile range is a measure of where the bulk of the values lie.

In Simple words, the IQR is the first quartile subtracted from the third quartile
Formulae :

IQR = Q3 − Q1
Where, Q3 is 75th percentile & Q1 is 25th percentile

The IQR can be used to identify outliers by defining limits on the sample values that are a factor k of the IQR below the 25th percentile or above the 75th percentile.The common value for the factor k is the value 1.5.

A factor k of 3 or more can be used to identify values that are extreme outliers or “far outs” when described in the context of box and whisker plots.
On a box and whisker plot, these limits are drawn as fences on the whiskers (or the lines) that are drawn from the box. Values that fall outside of these values are drawn as dots

Ways to calculate percentiles in python?

  • Describe function in python Dataframe.
  • Percentile function in numpy.
  • Box / ViolenPlot.

Example Box plot dataset

How to calculate it using pen&paper

# Put the numbers in order.
1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27
# Find the median.
1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27
# Place parentheses around the numbers above and below the median.
(1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27)
# Find Q1 and Q3. Think of Q1 as a median in the lower half of the data and think of Q3 as a median for the upper half of data.
(1, 2, 5, 6, 7), 9, ( 12, 15, 18, 19, 27). Q1 = 5 and Q3 = 18
# Subtract Q1 from Q3 to find the interquartile range.
18–5 = 13.

What if I Have an Even Set of Numbers?

# Put the numbers in order.
3, 5, 7, 8, 9, 11, 15, 16, 20, 21
# Make a mark in the center of the data.
3, 5, 7, 8, 9, | 11, 15, 16, 20, 21
# Place parentheses around the numbers above and below the mark you made in Step 2–it makes Q1 and Q3 easier to spot.
(3, 5, 7, 8, 9), | (11, 15, 16, 20, 21)
# Find Q1 and Q3
(3, 5, 7, 8, 9), | (11, 15, 16, 20, 21). Q1 = 7 and Q3 = 16.
# Subtract Q1 from Q3.
16–7 = 9

Automatic Outlier Detection

External Links:

--

--