Statistical Methods for Identifying Outliers (For Univariate Data) (Part-I)

Chetan Borse
Analytics Vidhya
Published in
4 min readJun 29, 2020

--

Definition Outlier:

The outlier is an observation that so much deviates or far away from the
other observation. Outlier detection is important in data analysis. The purpose of the study is to investigate the outlier from the small samples or non-normally data set and it is problematic about their characteristic. So we convert the data into normal by deleting outlier.

Statistical Techniques and tools

1.1 Grubb’s Test
1.2 Inter-Quartile Range(IQR)
1.3 Dixon’s Test
1.4 Boxplot

1.1 Grubb’s Test :
Grubbs (1969) detects a single outlier in a univariate data set. It is a dataset that follows an approximately normal distribution and the sample size is less than 30. Grubb’s test is defined by the following two hypotheses.
Ho : There is no outlier in the data set.
H1 : There is one outlier in the data set.
There are several statistic for the Grubbs test considering an ordered data sample test if the minimum or maximum values are outliers.

Where is the element of the data set, X and S denoting the sample mean and standard deviation respectively and the test statistic is the largest absolute deviation from the sample mean in the units of the sample standard deviation.
The calculated value of parameter G is compared with the critical value for Grubb’s test. When the calculated value is higher or lower than the critical value of choosing statistical significance, then the calculated value can be accepted as an outlier.
Criteria:

1.2 Inter-Quartile Range(IQR)
This is the quantile method used to detect outliers from the univariate data sets. There is no need to use the quantile method in statistical tables. The following steps are used in this method.
i) First, we find Q1 and Q3 we find and i.e first and third quantile.
ii) Then find a difference of them i.e H = Q3-Q1
Criteria:
A value lower than Q1–1.5H and higher than Q1+1.5H is considered to be a mild outlier. A value lower than Q1–3H and higher than Q1+3H is considered to be an extreme outlier.

1.3 Dixon’s Test :
This test developed by “W.Dixon 9 0 and used to the test is appropriate for a small sample size. The test has some limitation to n≤ 30.
The Dixon defined the test statistic to detect outlier is

1.4 Boxplot :
Boxplot is a graphical tool to detect outliers. In boxplot, we give the different
arguments that are given to detect outliers. It produces box and plot the given data observation. In boxplot observation are off the box they are as an outlier.

Break Strength Data:
In this data, there are total 14 observations given and these observations are about break strength.

Break Strength Data

Here we first check normally of given data. So that we can apply appropriate tests to identify outliers if it present in the data.

Ho : Distribution of data is normal vs
H1 : Distribution of data is not normal

Conclusion:
Here p-value > alpha(0.05) so we fail to reject Ho at 5% l.o.s. Therefore we conclude that given data is normal.

Since given data is normal so we may use Grubb’s & Dixon test to identify outliers present if it presents in the data.

Conclusion:
Here we use type=10 option in grubbs.test which gives the lowest observation value of the data set is an outlier. i.e. 12.38 (observation number 10) is an Outlier.

Conclusion:
Here we get 12.38 (observation no. 10) give an outlier.

IQR

Conclusion:
Here we get 10 observations is less than Q1–1.5IQR , Hence 12.38 (observation no. 10) is an outlier.

Box-Plot:

fig: Boxplot of brake strength data

Conclusion:
From the box plot, we also get an outlier from the data which is 12.38 (observation 10).

Conclusion:
Here directly we get 12.38 (observation no.10) is an outlier.

##Summary of the above tests:

--

--