Outlier detection in R: Tukey Method or why you need “box and whiskers”

Dima Diachkov
Data And Beyond
Published in
7 min readJan 21, 2024

A warm welcome to anyone, who reads this article, and thank you for your interest!

Today, I’m super excited to write about something that might just change the way you look at data: the Tukey Method for spotting those pesky outliers. Remember when we talked about Grubbs’ Test recently? Well, Tukey’s approach is a WHOLE different ball game, and it’s perfect for anyone who’s just starting to get their hands dirty with data.

So again we are going to discuss how one can identify outliers, as we did for Grubbs’ test. And today in focus is Tukey’s method. This method is widely used for exploratory data analysis and has become a standard approach for identifying outliers in statistical and data analysis practices.

By the way, don’t forget to subscribe to my blog (if you haven’t done it before), so you can find out more about data analysis and treatment of outliers in R. I truly appreaciate your interest and it helps me to find more motivation to post more often.

Tukey’s method: The IQR logic

John Tukey was a super-smart statistician who came up with a cool trick. Unlike Grubbs’ Test, which zooms in on the most extreme values, Tukey’s method looks at the data’s spread using the Interquartile Range (IQR). It’s less about hunting down a single weird value and more about understanding your data’s overall story.

BUT: I am not sure that Tukey invented this method… At least I have not found papers, where he suggested using IQR to identify outliers. While Tukey developed a range of techniques and tests throughout his career, one of the most direct methods he proposed for identifying outliers is the use of the box plot (also known as a box-and-whisker plot), which he introduced.

The box plot is a graphical representation that depicts the distribution of a data set based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. Outliers are identified using the interquartile range (IQR), which is the difference between the third and first quartiles (Q3 — Q1). Tukey defined outliers as those points that fall below Q1–1.5*IQR or above Q3 + 1.5*IQR, providing a clear, quantitative criterion for identifying data points that are significantly different from the bulk of the data in a set. The coefficient of 1.5 is not a constant and can be adjusted for specific applications and with respective assumptions. But market practice, so to say, it to use the 1.5 coefficient.

The Intricacies of IQR

So, let’s dive a bit deeper into the IQR, shall we? The Interquartile Range is like the middleweight champion of your data set. It focuses on the central bulk of your data, cutting out the top 25% and the bottom 25%. What you’re left with is the core, the most typical values of your distribution. This is the range where the real action happens, and the outliers? They’re like the fans who couldn’t get seats in this central zone.

Let’s imagine that we have some distribution of an observed variable. As in the Grubbs’ Test, we will add one right-side outlier on our own.

# Make sure you have ggplot2 installed
# install.packages("ggplot2")
library(ggplot2)

# I
# Generate normal data
set.seed(123) # Setting seed for reproducibility
normal_data <- rnorm(100, mean = 50, sd = 5)

# Attach an outlier on the right
right_outlier <- c(normal_data, 100)

# Plot simple chart
plot(right_outlier)
Output for the code above

The outlier here is pretty obvious.

Now, you calculate the first and third quartiles (Q1 and Q3). Subtract Q1 from Q3, and voila, you have the IQR. And voila! You have all you need to test the data for outliers.


# computing IQR and outliers
Q1 <- quantile(right_outlier, 0.25)
Q3 <- quantile(right_outlier, 0.75)
IQR <- Q3 - Q1
multiplier <- 1.5
lower_bound <- Q1 - (IQR * multiplier)
upper_bound <- Q3 + (IQR * multiplier)

# Create a dataframe for ggplot
data_frame <- data.frame(value = right_outlier)

Lower and upper bounds, as you see, as based on Q1 and Q3 borders, from which we establish borders for normal data (closer than IQR * multiplier of 1.5 in this case), and hence — what is outside of this area is considered an outlier.

Let’s have a look at the boxplot, created by us.

# Create a boxplot
ggplot(data_frame, aes(x = factor(1), y = value)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1, color = "blue") +
labs(title = "Boxplot with Outliers", x = "Data Set", y = "Values") +
theme_minimal()
Output for the code above

Okay, so the black vertical lines are our borders, where we are confident about the data. Everything outside of it — suspicious. But we do see more than 1 outlier, right? The one below is suspicious as well, even though we did not make it like that. It is just random. Let’s highlight the outliers to be sure.

# Create a scatter plot to show outliers
ggplot(data_frame, aes(x = seq_along(value), y = value)) +
geom_point(aes(color = (value < lower_bound | value > upper_bound)), size = 3) +
scale_color_manual(values = c("black", "red")) +
geom_hline(yintercept = lower_bound, linetype = "dashed", color = "red") +
geom_hline(yintercept = upper_bound, linetype = "dashed", color = "red") +
labs(title = "Scatter Plot with Outliers Highlighted", x = "Index", y = "Values") +
theme_minimal()
Output for the code above

So indeed. Just by coincidence, we have identified more than 1 outlier. What a luck :-) Who are they?

# Print the outliers
cat(“Identified outliers are:”, right_outlier[right_outlier < lower_bound | right_outlier > upper_bound], “\n”)
Output for the code above

Can we do it we another dataset? Sure we can. Let’s add more fictional outliers and test it. Let’s add 6 straightforward or not-so-straightforward outliers.

# Attach completely different outliers 
new_outliers <- c(normal_data, 100, 95, 81, 10, -1, -25)

# Plot simple chart
plot(new_outliers)
Output for the code above

Now it is time to look at them through the IQR prism and Tukey’s method.

# computing IQR and outliers
Q1 <- quantile(new_outliers, 0.25)
Q3 <- quantile(new_outliers, 0.75)
IQR <- Q3 — Q1
multiplier <- 1.5
lower_bound <- Q1 — (IQR * multiplier)
upper_bound <- Q3 + (IQR * multiplier)

# Create a dataframe for ggplot
data_frame <- data.frame(value = new_outliers)
# Create a boxplot
ggplot(data_frame, aes(x = factor(1), y = value)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1, color = "blue") +
labs(title = "Boxplot with MORE Outliers", x = "Data Set", y = "Values") +
theme_minimal()
Output for the code above

So all of them are spotted! All six lie outside the whiskers’ area. Let’s put it in the other format.

# Create a scatter plot to show more outliers
ggplot(data_frame, aes(x = seq_along(value), y = value)) +
geom_point(aes(color = (value < lower_bound | value > upper_bound)), size = 3) +
scale_color_manual(values = c("black", "red")) +
geom_hline(yintercept = lower_bound, linetype = "dashed", color = "red") +
geom_hline(yintercept = upper_bound, linetype = "dashed", color = "red") +
labs(title = "Scatter Plot with Outliers Highlighted", x = "Index", y = "Values") +
theme_minimal()
Output for the code above

So all six are tracked down. The underlying robustness of this method is often impressive, but of course, real-world distributions are not that easy to understand. However, simple log-transform can make this method work for most real-world datasets. Rescaling may also help.

Limitations

So… Tukey’s rule is pretty straightforward. You take the IQR and multiply it by a factor — typically 1.5 (for a mild outlier) or e.g. 3 (for an extreme outlier). These factors are like the bouncers of your data club; they decide who’s too far out to be allowed in. You subtract this product from Q1 to get the lower bound and add it to Q3 to get the upper bound. Anything outside these bounds is considered an outlier.

But here’s the catch: not every point outside these bounds is an unnecessary distraction. Sometimes, they’re the most interesting part of your data, like that one friend who always has the most unbelievable stories. They’re outliers, sure, but they make life (and your data) far more interesting.

Conclusion

When you apply the Tukey Method in R, it’s like giving your data a thorough health check. You’re not just looking for symptoms (outliers), but you’re also getting a feel for the overall well-being (distribution). It’s a powerful method because it’s robust; it doesn’t get thrown off by the outliers it’s trying to detect. Good luck with it and feel free to comment if you used it before and have something to share.

Please clap 👏 and subscribe if you want to support me. Thanks!❤️‍🔥

Author’s AI-generated image

--

--

Dima Diachkov
Data And Beyond

Balancing passion with reason. In pursuit of better decision making in economic analysis and finance with data science via R+Python