Outlier detection 101: Median and Interquartile range.

David H
4 min readAug 13, 2019

--

Introduction

  • This article assumes that you are somewhat familiar with basic statistics.

Suppose you have these data points:

[ 30, 50, 63, 474, 78, 999, 997, 61, 74, 83, 92, 100, 55, 56, 77 ]

By looking at it, it is reasonable to consider 474, 999, 997 to be outliers. Let’s visualize this.

Fig 1. Visualization of the data. The red dots are the outliers.

Now, suppose you want to develop a SYSTEMATIC approach to detect the outliers of similar data sets. The first thing that comes to most people’s mind is using standard deviation and mean:

mean = 219.27

standard deviation (std) = 322.04

Now one common approach to detect the outliers is using the range from mean-std to mean+std, that is, consider any data points outside the range of [mean-std : mean+std] to be outliers. Let’s apply and visualize this.

Fig 2. Detecting outliers using mean and std. the blue regions indicate the range [mean-std : mean+std]. The middle blue line is mean, and the two blue lines that encloses the blue region are mean+std and mean-std.

In the above figure, The middle blue line is mean, and the two blue lines that enclose the blue region are mean-std and mean+std. Hence the colored region indicate the range [mean-std : mean+std]. However the outlier with a value 474 was included in the blue region and thus this approach has failed to detect it as an outlier, even though it clearly is. Similar things will often occur if you were to apply the mean and std method of detecting outliers in other data sets.

Now is there a way to fix this? Sure. You can try playing with a scalar factor s such that [mean-s*std : mean+s*std] is the new range that will produce smaller blue region. However while that may fix the immediate problem, that would mean you have to obtain a value of s that fits each data set, which is far from efficient. This scalar factor does not fix the fundamental problem of using mean and standard deviation, which is that they are highly contaminated by the outliers. One or small number of data points that are very large in magnitude(outliers) may significantly increase the mean and standard deviation, especially if the number of data points is not very large.

Median + Interquartile Range

So we need a different approach. This is where median and interquartile range comes in. To recap your statistics knowledge, some definitions are in order.

median is the value that is literally in the middle, that is, it is the value such that in the entire data set, roughly 50% of the data are greater than median, and roughly 50% of the data are smaller than the median (I used roughly because if the number of data points is even, it is the middle 2 values divided by 2. It is exact if odd data points). For example, the median of [0,10,100] is 10.

x percentile is a value such that it is greater or equal to x% of the data points and less than or equal to (100-x)% of the data points. For example, if you are in the 99% percentile in a test score, then that means you scored better or equal to 99% of the people who took the same test.

Interquartile range (IQR) is simply the difference between 75% percentile and 25% percentile.

Now let’s apply this to our data:

[ 30, 50, 63, 474, 78, 999, 997, 61, 74, 83, 92, 100, 55, 56, 77 ]

median= 77.0

Q1 = 25% percentile = 58.5

Q3 =75% percentile = 86.0

Interquartile range= 75% percentile-25% percentile = 86.0–58.5 = 37.5

Similar to what we did in mean + std case, we consider values outside the range of [Q1 -1.5*IQR : Q3+1.5*IQR] to be outliers. The factor 1.5 is standard since in general, factor of 1 is quite small and often misses non-outlier data points when the data set has medium to high variance. The result is shown below.

Fig 3. Outlier detection using median and interquartile range. The middle blue line is median, and the blue lines that enclose the blue region are Q1-1.5*IQR and Q3+1.5*IQR

The middle blue line is median and the blue lines that enclose the blue region are Q1-1.5*IQR and Q3+1.5*IQR. Unlike mean and std, the blue region created by median and IQR seems to tightly enclose around the black dots only, ignoring the outliers like a plague.

Discussion

In practice, applying this technique isn’t as simple as the above example. Unlike real data, the above example is time-independent, has no high order polynomial dependence, and no oscillations with minimal noise. To effectively utilize interquartile range for outlier detection, proper pre-processing is a necessity in order to set it up for accurate detection. This includes, subtracting the linear/polynomial dependence, Fourier fit, applying logarithm, and performing the calculation on per unit, and more.

Conclusion

Median and Interquartile range provides a powerful tool for detecting outliers that can be used instead of mean and standard deviation due to its invulnerability against outlier contamination.

--

--