Anomaly Detection Techniques Part II

DP6 Team
DP6 US
Published in
4 min readJan 17, 2019

Introduction

Continuing the Anomaly Detection Techniques series, we’re going to talk a bit more about the types of anomalies that can be found in the digital marketing area. We will also discuss an improvement on the previously presented technique: the modified Z-score.

According to Filiben (2013) an outlier may indicate cases where incorrect data has been collected due to errors. These, when detected, should be excluded from the analysis being done.

In the context of Digital Marketing, for example, an abnormal peak on a Friday in November may seem like an anomaly. However, when considering the seasonality of the business, the abnormal Friday may be Black Friday. Thus, this high volume ceases to be an anomaly.

Data collection projects regularly involve Google Tag Manager, Tag configuration, Data architecture, Landing Pages, Media Pixels, among others. There are several variables that involve human action and are therefore susceptible to errors.

Who doesn’t remember a case where a page was altered and its collection compromised? Or where the flow was changed without tagging, or where a tag was configured without input from the business team and is no longer collecting what it should? Or even when the media pixel is behaving suspiciously?

Errors are a fact of life and anomaly detection techniques help us to identify them quickly and avoid compromising decision-making.

In part I we discussed the method of detecting anomalies using the Z-Score. It is based on the relation of a point with the mean and standard deviation of a group of points. We have also seen how this method is influenced by outliers, i.e. it is not robust. This can be a problem in certain business contexts. Let’s look at a variation of the method that handles it well: the Modified Z-score.

Modified Z-score

According to Seo (2002), to resolve the Z-score problem, the median absolute deviation (MAD) is used in the modified Z-score (Mi) instead of the mean and standard deviation. This approach was presented by Iglewics and Hoaglin (1993).

The median absolute deviation is given by:

Below we show an example in Python of a base of sessions with outliers deliberately inserted and compare the effectiveness of outlier detection.

Below we have two graphs with the data generated and the anomalies inserted. In the first one in red, with the Z-score method, the outliers would not be detected. In contrast, the Modified Z-Score (highlighted in blue) is able to detect them.

We now know that the modified Z-score identifies outliers that the Z-Score does not. Does that mean one method is better than the other?

We must be careful when working with outliers. It is always good to use more than one detection method and to analyze whether the identified cases are in accordance with your business.

In addition, we must be on the lookout for the effects of masking and swamping. According to Chiang (2007), the masking effect occurs when an outlier is not identified due to the presence of others. The swamping effect represents a point incorrectly identified as an outlier because it is in a “symmetric” data set.

A common example of swamping occurs when we detect abnormal behavior. The event may be planned (as per the Black Friday example given above) and should not be considered an outlier, and must be removed from the analysis.

Conclusion

With data collection and tagging, it is normal to have errors. However, what generates added value to the business is the rapid detection the problem followed by action to remedy it. In this sense, the techniques for detecting anomalies help us greatly.

In this post we presented the Modified Z-score, which in contrast to the Z-score, is a robust detection method because it uses the median as a base of calculation, i.e. it is not influenced by outliers. Therefore, during the exploratory data analysis, when faced with a distribution with high dispersion, we suggest the use of the modified Z-score method.

However, this does not mean that it is a better model. It is merely a new perspective of analysis that must be taken into account when applied to the business.

In the next post, we will cover another technique that is used for the detection of outliers: STL, a technique of decomposition of time series.

Profile of the author: Jaime | Graduated in Computer Engineering at Unicamp, currently working at DP6 in Digital Analytics.

Profile of the author: André Tocci| Graduated in Business Administration at FAAP, currently working at DP6 in Digital Analytics.

--

--