Outlier detection in R: Grubbs’ Test

Published in

Data And Beyond

7 min readDec 28, 2023

Warm welcome!

Today, I’d like to share some thoughts on a crucial yet often overlooked aspect of data analysis, especially when dealing with model examples and explanatory analysis using dummy data: outliers.

Outliers can significantly skew your results, leading to misleading conclusions. There are many techniques available for identifying outliers, and Grubbs’ Test stands out as one of these methods, much like a knight in shining armor.

From my experience, particularly in fields such as economics, finance, and scientific research, Grubbs’ Test is invaluable in pinpointing outliers, thereby ensuring the robustness of statistical analysis. What’s more, this approach is straightforward and can be employed without advanced coding skills.

Please note that in other fields of application, this method is not as frequently used (but I will definitely publish more articles on different methods in other industries in the future).

I. What is Grubbs’ Test?

Grubbs’ Test, developed by Frank E. Grubbs, is a statistical test designed to detect outliers in a univariate dataset that follows a normal distribution. It serves as a detective tool in your statistical toolkit, adept at identifying the odd data point that seems out of place.

Frank E. Grubbs wrote his dissertation on outlier detection in 1949 (source) and later published his method for detecting outliers in 1950 (source). Over the decades, his method has become one of the most standard tools for outlier detection.

The Essence of the Test

The test identifies an outlier by measuring how far a data point deviates from the mean, relative to the standard deviation of the dataset. It’s based on the assumption that the largest absolute deviation from the mean is a potential outlier.

Please note, that this test is designed to be applied to normal distributions. For real world applications, if you are intended to use this test, please clearly explain the assumptions of normality in your research.

II. The Mechanics of Grubbs’ Test

Grubbs’ test is designed to test for one outlier at a time in a dataset. This test assumes that the dataset comes from a normally distributed population without outliers. Grubbs’ test detects the presence of an outlier by measuring the largest deviation from the sample mean, considering both the maximum and minimum values in the dataset.

Here’s how it works for both the maximum and minimum values. I will present it in LaTex-based pdf format to make sure that the formulas are presented correctly.

The critical values for Grubbs’ test depend on the significance level (e.g., 0.05 for a 95% confidence level) and the number of observations in the dataset. These values can be found in statistical tables or calculated using statistical software like R and package “outliers” which includes functions for Grubbs’ test.

After identifying an outlier, it should be removed from the dataset, and the test should be run again to check for the presence of more than one outlier. It’s important to do this sequentially because Grubbs’ test is only valid for testing a single outlier (also we can test two outliers, but with the same logic).

III. Implementing Grubbs’ Test in R

R makes conducting Grubbs’ Test a super easy task. Here’s a quick example using the outliers package:

# Install and load the necessary package
install.packages("outliers")
library(outliers)

Now let’s generate some normal data and purposefully attach one outlier to the right side (a value significantly higher than the rest of the data).

# Generate normal data
set.seed(123) # Setting seed for reproducibility
normal_data <- rnorm(100, mean = 50, sd = 5)

# Attach an outlier on the right
right_outlier <- c(normal_data, 100)

# Perform Grubbs' test for the right-side outlier
grubbs.right <- grubbs.test(right_outlier)
print(grubbs.right)

Run it and you will this this output.

This G statistic is a measure of the extremeness of the most extreme value, and when compared against a critical value from the Grubbs’ distribution, it confirms that such extreme value is not part of the normal variation of the data. The output provides the G-value and the corresponding p-value. A small p-value (typically <0.05) indicates that the suspected point is statistically different from the rest of the data, suggesting it’s an outlier.

So now we know that 100 indeed is an outlier. Let’s quickly look at the generated data to confirm the statistical inference with our own eyes which are always the last resort.

# Make simple plot
plot(right_outlier)

Next, we’ll attach an outlier on the left side (a value significantly lower than the rest of the data).

# Attach an outlier on the left
left_outlier <- c(normal_data, 0)

# Perform Grubbs' test for the left-side outlier
grubbs.left <- grubbs.test(left_outlier)
print(grubbs.left)

So let’s have a look at the new data to make sure that test makes sense.

# Make simple plot
plot(left_outlier)

As I have mentioned before, this test is one-sided. But, in the grubbs.test function there is a possibility to run it in two-sided form. Hence, finally, we can conduct Grubbs’ test considering both sides, which checks for any value that significantly deviates from the rest, either too high or too low.

# Attach outliers on both sides
two_sided_outliers <- c(0, normal_data, 100)

# Perform Grubbs' test for two-sided outliers
grubbs.two.sided <- grubbs.test(two_sided_outliers, type=11)
print(grubbs.two.sided)

In this code above, type=11 will give us a two-sided test. Please refer to the documentation for the detailed explanation or write in the comments.

This output from the Grubbs’ test provides clear statistical evidence of TWO outliers in our dataset. With a G statistic of 11.95850 and a practically zero p-value (p < 2.2e-16), the test strongly rejects the null hypothesis, indicating that the values 0 and 100 are indeed outliers.

This result is significant because it not only identifies the presence of outliers but also quantifies their deviation from the expected range of the data. The exceptionally low p-value underscores the confidence we can have in this result, suggesting that these are not values we would anticipate encountering by chance within a normally distributed dataset. This forms a strong case for their exclusion or separate analysis to prevent them from distorting the analysis of the remaining data, which presumably follows a normal distribution.

Please note, that when interpreting these results, it’s important to consider the context in which the data were collected. If the outliers represent measurement or recording errors, removing them may be justified. But if they represent rare, yet possible, observations within the study’s domain, further investigation into their cause may provide valuable insights. In either case, recognizing and appropriately handling outliers is crucial for maintaining the integrity and validity of the data analysis process. Maybe I am too boring, but I will keep repeating that.

IV. When to Use Grubbs’ Test

Grubbs’ Test is ideal when:

You have a relatively small dataset.
You suspect one one-side outlier, two opposite outliers, or two outliers on one side.
Your data is approximately normally distributed (or assumed to be as such).

Caveats and Considerations

Sample Size: Grubbs’ Test is sensitive to sample size. With very large datasets, even normal variations might appear as outliers.
Single Outlier Detection: This test is designed to detect one outlier at a time. For multiple outliers, repeated application of the test is necessary.

Keep in mind that applying Grubbs’ test multiple times to the same dataset (after removing outliers) can increase the chance of a Type I error (incorrectly rejecting the null hypothesis). Therefore, it’s essential to adjust the significance level accordingly or use alternative methods suitable for multiple outlier detection.

V. Conclusion

Grubbs’ Test is a powerful tool for identifying outliers, especially in small datasets. It plays a crucial role in ensuring the integrity of your statistical analysis, leading to more accurate and reliable conclusions. It is a simple and quick, reliable technique when you suspect one or two outliers. Remember, outlier detection is an essential step in data preprocessing.

VI. A few words of appreciation for 2023

Thank you to all of my readers for your continued interest in and support of this blog. When I started this journey in January (first post), my goal was to immerse myself more deeply in the economics and data community. Also, I wanted to develop myself as an author. With your help, I’ve been able to connect with over 300 like-minded individuals.

I am also grateful to the publication channels of DevGenius Team and Dmytro Iakubovskyi who have invited me to share my articles with a broader audience.

A special thanks to the team at Medium.com —Medium, Medium Staff— for their unwavering support of this platform and its authors!

I wish every one of you the very best: health, luck, fun, and joy. Happy New Year!