Robust Location and Scale Estimator in Outlier Detection

Vahid Naghshin
Analytics Vidhya
Published in
8 min readMay 13, 2020

One of the main goals of statistical analysis is to find the location and scale parameters for a statistical distribution. The location parameter specifies the typical value, i.e., the central value of the distribution while the scale parameter is used to measure the dispersion or variation of the distribution.

Cassiopeia A

For location parameter, three common definitions can be used:

  1. Mean value: this is the arithmetic mean of data samples which is usually referred to as average of data samples. The mean value is affected by the extreme values in tails easily.

2. Median value: this value represents the middle data point that half of data is smaller and half of data larger than this point. In contrast to the mean value, the median value can be the exact data point. For the case that, the number of observation sample is odd, the median value is the data point resides in middle if the whole observations is sorted in ascending order. In the case that, the number of observation points are even, the median is obtained by average of two data points in the middle. One advantage of median is that, it is less affected by the extreme values than mean value and hence best candidate for robust location estimation for the case that there are many outliers.

3. Mode value: this value represents the value that occurs with the highest probability. This value is usually obtained by deriving the histogram of the observation samples.

Depending on the shape of the distribution, the mean, median and mode can be used as a representative of the location parameter. For example, if the underlying distribution is symmetric around central value without heavy tail such as normal distribution, the mean can be used as a good candidate for the location. In normal distribution, the median, mean and the mode are almost the same. However, in the case of skewed distribution such as exponential or log-normal distribution, the mean is different than the median. In the skewed distribution, it is not always obvious to see which location describes the distribution the best and it is better to mention all these parameters. In other class of distributions which are symmetric but heavy tailed the median is the better location estimator than the mean value. An example of heavy tailed is Cauchy distribution. In Cauchy distribution the mean is not defined. In other words, the mean value is not converging to a single value as the sample size increases. This is due to the fact that the mean value is heavily affected by the extreme values in tail of distribution. In this case, median is a good location estimator as it is a rank-based estimator. In robust statistics context, various alternatives are proposed to combat against the non-normality of the data since mean is a good representative when the underlying distribution is normal. Two common approaches in achieving robust location estimator for mean value are

Mid-Mean: computes the mean using the data between 25 and 75 percentiles.

Trimmed mean: computes the mean using the data between 5 and 95 percentiles.

In measuring the scale parameter of distribution two key components should be taken into account:

  1. The dispersion around the centre value, i.e., location parameter
  2. How dispersed are the tails?

Several numerical measures are proposed to measure the amount of data dispersion each emphasises different components mentioned previously. Some proposed measurement for estimating scale parameter are:

  1. Variance value: this value is defined as the arithmetic average of the squared distance from the mean.
  2. Standard deviation (SD): this is the square root of the variance. It has the same unit as the data samples.
  3. Range: this is the difference between minimum and maximum value of the data samples. It does not reflect any dispersion around the central value such mean or median.
  4. Average absolute deviation: this is the arithmetic average of absolute difference between data samples and mean value. Since its dependency on distance is linear, it is less affected by extreme values than variance or SD.
  5. Median absolute deviation (MAD): this is the median of the absolute difference between data samples and median value.
  6. Interquartile range: this is the difference between 75 and 25 percentiles. It is a measurement for variation around the center.

Similarly, depending on the underlying distribution, you can choose the appropriate scale estimator to reflect the dispersion around location parameter and spread in tails. When the underlying distribution is symmetric with light tail such as normal, SD and variance is a good scale estimator. For symmetric but heavy-tailed distribution such as Cauchy distribution, the SD or variance is not a good scale estimator as they do not converge to a single value as data size increases. In other words, as the size of data increases, the accuracy of SD or variance (and the mean value) does not improve. For heavy-tailed distributions, the median-based location and scale estimator is a robust estimator to handle the extremes in tails.

The relationship between standard deviation and MAD

One of the approaches in outlier detection in univariate data is to mark any point as an outlier which is placed three SD away from the sample mean. It is noteworthy that the population standard deviation is a well-adopted measure of distance by which to identify outliers. However, SD and sample mean are very susceptible to the outlier which makes them inappropriate in outlier detection. This is due to the fact that, in SD the extreme points has higher effect in SD evaluation than the points close to the mean. Based on this fact, we need a measurement that is robust against outliers. One of the possible candidate for this aim is the median absolute deviation from median, commonly shortened to the median absolute deviation (MAD). The median is defined as the median of the absolute deviations from the median of observation set.

The application of location and scale parameters

The location and scale parameters are quintessential part of data exploratory process. They are used in many applications to better reveal the patterns which are in most cases are contaminated with noise. One application of location and scale parameter is in outlier detection. Outliers are data samples which deviates markedly from majority of data. Some outliers can be as a result of erroneous measurement or different distributions. Irrespective on treating outliers, one thing you should do is assuring that the desired parameter is not affected significantly by the outliers. Using robust estimators can subdue the impact of outliers. There are many techniques in identifying outliers in observation samples. The efficacy of these techniques depends on the nature of the underlying distributions.

One of the well-accepted method in identifying outliers is using notion of normal distribution. In other words, assuming the samples are drawn from symmetric light-tailed distributions, such as normal, the outliers are identified by determining the rejection region. If you consider the normal distribution, the probability that the observed sample is three SDs away from the location value is at most 4%! See the normal distribution below.

Normal distribution

Thus, any observed value which is out of the three SDs range is considered as a potential outliers. The acceptance range for potential valid data is

Acceptance range based on three SD notion

where M is the mean value. However, the mean and SD value needed to remove and identify outlier is affected themselves by outliers! This is why we need robust location and scale estimator to give a truly robust rejection range for outliers. Needing to a robust estimator become even more important in the case where the underlying distribution is skewed with small number of samples [2]. In highly skewed distribution, the large amount of observations is located densely in one side of median and the other spread sparsely on the other side.

Application of MAD: Absolute deviation from the median was (re-)discovered and popularised by Hampel (1974) who attributes the idea to Carl Friedrich Gauss (1777–1855) [1]. Median, like mean, is a measure of central tendency but very robust to the outliers. This roots from this fact that the break down point of the estimator is 0.5 for the median while this value is 0 for mean. The breakdown point is the maximum number of observations needed to be contaminated to lead to false result. For example, if one of the observations is recorded as very large, i.e., infinity, the mean becomes infinity. This is said that the breakdown of the mean estimator is zero. For the median, more than 50% of the observations should be contaminated to result false result. Furthermore, the MAD is immune to sample size. However, the caveat about the MAD method is that the MAD takes a symmetric view on underlying distribution with equal dispersion from both sides meaning positive and negative deviation from location parameter. Fortunately to tackle non-normality problem, we can use transformations such as Box Cox to obtain normal shape distribution and then apply MAD approach to remove outliers.

Using MAD, we can use three SD formula to identify the outliers. Since, outliers affect SD estimation adversely, we can use the relationship between MAD and SD, to estimate SD! it is given by equation below

where Q(0.75) is the 75 quantile of the sample distribution. In case of normality, 1/Q(0.75) = 1.4826. We can use the three-SD rule to rule out all the observations that is farther than three SD from the median of the observed data. That is we keep all observations that lie in the range.

In outlier removal procedure it is informative to mention the size of outlier value and size. There is one elegant point here. By doing this, we capture the points which is farther than three-SD from the median from both extremes. This makes sense if the underlying distribution is symmetric (such as normal distribution). However, in the case of skewed distribution, this might fall flat. In highly skewed distribution, the large amount of observations is located densely in one side of median and the other spread sparsely on the other side.

Conclusion

In this article robust location and scale estimator are discussed in the presence of outliers. Depending on the type of underlying distribution, an appropriate estimator should be adopted. Based on the robust estimator, a simple robust outlier detection is presented.

Reference

[1] Leys, Christophe, et al. “Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychology 49.4 (2013): 764–766.

[2] Cousineau, D., & Chartier, S. (2010). “Outliers detection and treatment: A review”. International Journal of Psychological Research, 3(1), 58–67.

--

--

Vahid Naghshin
Analytics Vidhya

Technology enthusiast, Futuristic, Telecommunications, Machine learning and AI savvy, work at Dolby Inc.