Photo by Will Myers on Unsplash

Outlier Detection: A Comprehensive overview of different methods

Mathematical and statistical background of Various outlier detection methods with a detailed comparison between the use cases

Imdadul Haque Milon
Published in
7 min readMay 8, 2021

--

Photo by Will Myers on Unsplash

Most real-world data sets have outliers that have unusually small or big values than the typical values of the data set. Outliers can have a negative impact on data analysis if not detected correctly and also outliers can reveal significant information and characteristics of data. There are several methods to detect an outlier in the data set. It is important to understand the characteristics of these methods properly since they might be quite powerful with large normal data but it is problematic to apply them to nonnormal data or small sample sizes without proper knowledge of their characteristics. In this article, we will discuss different outlier detection methods and how to apply them properly in the data set.

Outlier

An outlier is an observation that lies at an unusual distance from other observations in a random sample of a population. There’s actually no specific way to define the unusual distance and it depends on the datasets, use cases, and applications. Generally, the person who works with data defines what should be the unusual distance. Outliers can be caused by many reasons e.g. data entry mistakes, taking data from different populations, wrong measurements, etc. To see the impact of an outlier we can take a simple data set 1,2,3,4,5,6,7,8,9 and explore the various statistics of this data set.

Now, if we change the last observation from 9 to 99,

As we can see, the mean and variance have become much bigger due to one value got bigger while the median remains the same. Also, the 95% confidence interval has become wider. Due to this one data point, we can get the wrong insight about the whole dataset and it can affect our analysis.

Outlier Detection Methods

Outliers can be univariate, one variable, or multivariate, more than one variable, as well. There are two types of outlier detection methods e.g. formal tests and informal tests. Formal tests are also known as tests of discordancy while informal tests are known as outlier labeling methods.

Standard Deviation Method

One of the simplest and classical ways of screening outliers in the data set is by using the standard deviation method. We can define an interval with mean, as a center and x̅ — 2SD, x̅ + 2SD being two endpoints respectively. This is called the 2SD method. Observations that fall outside the interval are defined as outliers. Generally, we use 3SD by default.

According to the Chebyshev inequality, if a random variable X with mean μ and variance σ² exists, then for any k>0,

From this inequality, we can say that at least 1 — (1/k²) proportion of total data must fall within the k standard deviation of mean e.g. about 75%, 89%, and 94% of the total data are within 2, 3, and 4 standard deviations of the mean, respectively. This theorem is true for any data in any distribution. Now, let’s look at an example data set X to explain this method:

The mean for this data set x̅ = 14.53 and SD = 14.45. If we calculate the interval for the 2SD method we get an interval (-14.37, 43.43). We have detected 45 and 55 as outliers. But if we take an interval for 3SD we get (-28.82, 57.88) that doesn’t detect any outliers. The drawback of this method is, standard deviations are dependent on the mean and mean can be affected by outliers. If the outlier is big it can affect the Standard Deviations.

Z-score

One of the common tools in all of the statistics is Z-score. Z-score can be defined as the number of standard deviation, σ a certain data point is, below or above, from the mean, .

In the SD method, we have formed an interval x̅ — 2SD, x̅ + 2SD. In the Z-score method, we take the difference from the observation and mean and divide it with the standard deviation to find out the number of standard deviations is equivalent to the difference. For any observation, if the absolute value of the Z-score exceeds 3 then that observation is considered as outliers. If we look closely Z-score formula is actually the same as the 3SD method. The possible maximum Z-score is dependent on the sample size and computed as (n-1)/√n for a sample size of n.

For case 1, with all of the observations included even though 45 and 55 are outliers but Z-score for no observation exceeds the absolute value of 3. For case 2, when we have excluded 55, the most extreme value, we have detected 45 as an outlier. This is because multiple extreme values have inflated standard deviation.

Modified Z-score Method

In the Z-score method, we have used two estimators, the sample mean and standard deviation, that can be affected by a single or multiple extreme values. To avoid these issues we use the median and the Median Absolute Deviation in the modified Z-score method. The modified Z-score Mᵢ is computed as:

To find out MAD let’s take a set of numbers: 1,2,3,4,5. The median will be 3. Now, we will subtract the median from each x-value:

We consider an observation as an outlier if |Mᵢ|>3.5.

Here a comparison of the Z-score method and modified Z-score is presented on the previous data set. We can notice that even though the Z-score method is failed but using modified Z-score, we have detected 45 and 55 as outliers. This is because the modified Z-score is less sensitive to extreme values.

MADₑ Method

The MADₑ method is similar to the SD method but it uses median and Median Absolute Deviation (MAD), instead of mean and standard deviation. MADₑ method is defined as follows:

After scaling the MAD value by a factor of 1.483, it is similar to the standard deviation in a normal distribution. In our dataset,

Tukey’s Boxplot Method

Tukey’s boxplot method is a well known simple graphical method to find out five-number summary and outliers in univariate data. This method is less sensitive than previous methods since it doesn’t use sample mean and variance but quartiles instead. From the five-number summary, we first get the value of Q1 and Q3 and from that, we get the interquartile range.

Using this we define an interval and any value outside this interval will be defined as outliers.

Jhon Tukey proposed that k = 1.5 indicates an outlier where k = 3.0 indicates that the data is far away. There is no statistical basis for which Tukey uses 1.5 for inner and 3 for outer fences. For our example dataset,

So the defined interval according to Tukey will be

We can see the lowest value from our data set that falls in this interval is 3 and the largest value is 15.00 while 45 and 55 are detected as outliers.

Carling Median Rule

The median rule was introduced by Carling and it actually a substitution method for Tukey’s method where quartiles are substituted by median and a different scale is used. The method was defined as:

Conclusion

In this article, we have seen different outlier detection methods with their strengths and weaknesses. We have also seen the statistical and mathematical backgrounds behind the methods. Outliers can play a very crucial and significant role in data analysis. These methods can help us detect and remove outliers efficiently.

Originally published at https://gadictos.com on July 23, 2020.

--

--