Outlier Detection in R: Hampel Filter for time series

Published in

Data And Beyond

6 min readFeb 2, 2024

… or the method you probably never heard of. Maybe I am wrong but this method is the most popular and at the same time highly underestimated. So we are going to fix this gap today.

In the industry of outlier detection, there are still many tips and tricks. Just like we dissected Grubbs’ Test and the Tukey Method, it’s time to see how the Hampel Filter can help us clean our data.

For those who have been following my journey through the intricacies of statistical analysis, make sure you’re subscribed to my blog. It’s your support that fuels my deep dives into the exciting world of data.

The Hampel Filter: A Time Series Sanitizer

Time series data can be tricky; it’s a sequence of data points indexed in time order, often fraught with noise, seasonality, and yes — those pesky outliers. Whether you’re tracking stock prices, monitoring weather patterns, or analyzing web traffic, outliers can throw a wrench in your analysis. And here comes the Hampel Filter, your temporal data’s sanitizer.

The Hampel Filter identifies outliers based on the median absolute deviation (MAD), a measure less affected by outliers in the data than the standard deviation. It’s particularly useful in situations where data may be skewed or contain several unusual points that could skew the mean and standard deviation.

Application

Step 1: Attaching of libraries and raw data

Here, we’re gearing up our R environment with the libraries we need: pracma for the Hampel function, ggplot2 for plotting, and forecast for any time series magic we might need later. The set.seed function ensures that our results are reproducible. We create a synthetic time series, time_series_data, simulating monthly data over ten years.

library(pracma)
library(ggplot2)
library(forecast)

set.seed(123)
time_series_data <- ts(rnorm(120, mean = 50, sd = 5), frequency = 12)

plot(time_series_data)

This is what our data looks like. It currently has no outliers, so next step is to add them.

Step 2: Introducing Chaos

What’s a good detective story without a twist? Here we introduce our outliers — the spikes — into the time series. These represent sudden, unexpected events that could skew our analysis. Look at the chart below, there are 3 jumps for us to tackle in the outlier identification exercise.

spikes <- c(50, 75, 100) 
time_series_data[c(30, 60, 90)] <- time_series_data[c(30, 60, 90)] + spikes

plot(time_series_data)

Step 3: Hampel Filter

The hampel function is our detective. We apply it to our data, and it returns a cleaned series, which we then wrap back into a time series object.

filtered_series <- ts(hampel(time_series_data, k=6)$y, frequency = 12)
plot(filtered_series)

Here you go, smooth time series are ready.

But what is this k argument? In time series analysis, when using the Hampel filter, the k parameter determines the number of elements on either side of the current data point to consider when calculating the median and median absolute deviation (MAD). This is essentially the window size that the filter will use to examine the data.

But it is that important? The choice of k is crucial because it defines the locality of the filtering process. A small k value means the filter is very sensitive to local fluctuations, which can be good for catching very short-term anomalies but may also lead to a higher rate of false positives, especially in noisy data. A larger k value makes the filter smoother and less prone to react to small, local changes in the data, thus it's better for identifying larger, more pronounced outliers.

Choosing the right value for k often depends on the specific characteristics of your time series, such as:

Frequency of Data Points: If your data is sampled at high frequencies (e.g., minute-by-minute), a larger k might be necessary to avoid reacting to what is normal short-term volatility.
Expected Duration of Outliers: If you expect outliers to occur over longer periods, a larger k can help ensure these sustained shifts are recognized as outliers.
Seasonality and Trends: If your data exhibits strong seasonal patterns or trends, you might need to adjust k to avoid these patterns being mistaken for outliers.

In the absence of domain-specific knowledge, a common approach is to start with a value of k that reflects the typical cycle of the data. For example, with monthly data that has a yearly cycle, you might start with k = 6to capture a year's data on either side of each point (actually, it means that 6 observations before and after are taken to compare each point with it).

Step 4: Show the result of Hampel filtration

Evidence is key. We convert our original and filtered time series into data frames to prepare them for visualization. Each frame has a ‘Type’ to distinguish between the original and filtered data.

time_series_df <- data.frame(Time = as.numeric(time(time_series_data)), 
                             Value = as.numeric(time_series_data),
                             Type = "Original")
filtered_series_df <- data.frame(Time = as.numeric(time(filtered_series)), 
                                 Value = as.numeric(filtered_series),
                                 Type = "Filtered")

It’s time for the comparison before and after the smoothing. Using ggplot, we plot both the original and filtered data. The result is a stark visual comparison that shows just how effectively the Hampel Filter has cleaned up our time series.

combined_df <- rbind(time_series_df, filtered_series_df)


ggplot(combined_df, aes(x = Time, y = Value, color = Type)) +
  geom_line() +
  labs(title = "Time Series Data: Original vs Hampel Filtered", x = "Time", y = "Value") +
  theme_minimal() +
  theme(legend.position = "bottom")

As you see, we have successfully identified 3 introduced shocks and even randomly caught one more fluctuation between years 3 and 6 (around 4.5)

So, we’ve created a synthetic time series dataset, introduced some outliers, and then applied the Hampel Filter. The resulting plot will show both the original and the filtered series, allowing you to see the impact of the filter.

A Note on Seasonality

What makes the Hampel Filter especially suited for time series data is its temporal sensitivity. It respects the inherent order and flow of time series data, cleaning up data points that are out of tune with the temporal melody.

But I strongly recommend you understand your data first, before application of any filter. When dealing with seasonal time series data, applying the Hampel Filter can be a bit more nuanced. You’ll want to ensure that the seasonality isn’t mistaken for an outlier. A savvy approach is to seasonally adjust your data before applying the filter or to set the filter’s parameters with the seasonal pattern in mind.

Conclusion

The Hampel Filter offers a robust way to detect outliers, especially useful in datasets where the outliers can heavily influence the mean and standard deviation. It’s a tool that adds to the robustness of your data exploration, complementing other methods like the Tukey.

Please clap 👏 and subscribe if you want to support me. Thanks!❤️‍🔥

Stay tuned for more insights into the world of data analysis. Until next time, keep your datasets clean and your analyses robust!