Anomaly Detection in Google Analytics — A New Kind of Alerting
Google Analytics has rolled out a new kind of alerting feature: Anomaly detection. In this blog post I will go in depth with this new feature and how you can benefit from it.
What is an anomaly?
Before looking at the Google Analytics interface, let’s first examine what an anomaly is.
An anomaly, or outlier, is simply defined as something that does not conform to the expected.
Anomalies are frequently mentioned in data analysis when observations of a dataset does not conform to an expected pattern.
An example of this could be a sudden drop in sales for a business, a breakout of a disease, credit card fraud or similar where something is not conforming to what was expected.
Anomaly detection, also known as outlier detection, is about identifying those observations that are anomalous.
Frequent applications of anomaly detection
Before looking at the actual applications of anomaly detection, one must first understand how anomaly detection works with different kinds of data sets.
For the understanding of anomaly detection in Google Analytics, let us look at anomaly detection like this:
Anomaly detection for time series
Time series data are observations over a period of time. In this case your dataset has a time stamp. With time series, an anomaly detection algorithm will based on historical data identify observations that does not conform to the expected. This is the type of anomaly detection that Google uses for Google Analytics.
Anomaly detection for non-time series
On the contrary, anomaly detection doesn’t have to be applied in a time series context. Instead you could be looking for observations in your dataset that falls outside one or more clusters.
There are many other ways to categorise anomalies. DataScience.com nicely lay out a more common explanation of the main 3 different types of anomalies:
Point anomalies: A single instance of data is anomalous if it’s too far off from the rest. Business use case: Detecting credit card fraud based on “amount spent.”
Contextual anomalies: The abnormality is context specific. This type of anomaly is common in time-series data. Business use case: Spending $100 on food every day during the holiday season is normal, but may be odd otherwise.
Collective anomalies: A set of data instances collectively helps in detecting anomalies. Business use case: Someone is trying to copy data form a remote machine to a local host unexpectedly, an anomaly that would be flagged as a potential cyber attack.
And finally when you are looking at the methods for conducting anomaly detection with machine learning, you can look at supervised and unsupervised anomaly detection machine learning technologies.
Anomaly detection for a sizeable range of applications
There is no doubt that anomaly detection can be applied for all kinds of data analysis.
Anomaly detection is frequently used for applications such as:
- Credit card fraud: Is someone misusing a credit card?
- Server room monitoring: Is the temperature suddenly rising?
- Business metrics monitoring: Is sales in California dropping?
- Among many other applications..
The greatest challenge of anomaly detection.
The single biggest challenge in anomaly detection is to detect what are truly anomalous observations.
Essentially, anomaly detection is a kind of machine learning technology as it tries to predict anomalous observations. Getting this right isn’t easy.
But how do you evaluate the efficiency of your anomaly detection tool?
You use a confusion matrix (don’t worry, it isn’t confusion at all).
A confusion matrix looks at how well the model performed in terms of predicting anomalous observations.
In this case 100 of the predicted observations where actually true — these are called true positives (these you want to maximise). 10 observations are categorised as false positives. Observations that were predicted as anomalous, but weren’t. These you want to minimise, otherwise you get annoyed with false alerts. True negatives are actual observations you should have identified, but didn’t. And finally, false negatives are observations you correctly didn’t identify as anomalous events.
Alerting in Google Analytics
Almost since Google launched Google Analytics, it has been possible to setup alerts.
These alerts, however, are quiet static in the way you can customise them:
The problems about these alerts are that they have to be maintained. For instance, 10% change in traffic might be a lot one month, while during growth stage of your business, it might be very little.
Even worse when you use absolute values like, when visits increase by more than 50 visits week-over-week.
If you have had setup these kinds of alerts, chances that you have experienced false positives are pretty high.
However, over are the days where these static alerts are necessary.
Anomaly detection in Google Analytics
Now, to what everybody has been waiting for:
The new Anomaly Detection feature in Google Analytics.
Some time ago now, Google introduced their Analytics Intelligence alerts, which lets you know in case their machine learning algorithms detect anything there might be of value to you.
In line with these Analytics Intelligence alerts, Google recently rolled out a new feature that automatically notifies if they detect anomalous observations in your Google Analytics data.
It looks like this:
As you can see in the graph in the image, then Google has detected 3 anomalous observations in this time series (marked with red dots).
Furthermore, you can see how Google has identified what according to historic data is considered expected — they call this the forecasted value. This has two forecasted bounds, which is everything Google would categorise as “normal observations”.
Google describes their anomaly detection feature this way:
First, Intelligence selects a period of historic data to train its forecasting model. For detection of daily anomalies, the training period is 90 days. For detection of weekly anomalies, the training period is 32 weeks.
Then, Intelligence applies a Bayesian state space-time series model to the historic data to forecast the value of the most recent observed datapoint in the time series.
Finally, Intelligence flags the datapoint as an anomaly using a statistical significance test with p-value thresholds based on the amount of data in the reporting view.
Let’s break it down a little bit:
period of historic data to train its forecasting model
Means that they look at historic data to create the upper and lower bounds seen in the screenshot from Google Analytics.
Bayesian state space-time series model
This is a specific model for detecting anomalies. There are many different kinds of models — Google has chosen this one.
statistical significance test with p-value thresholds
This is to ensure that the detections are statistically valid. P-values are the way you can measure statistical significance.
Google Analytics only does anomaly detection on times series data and at the moment there is no option to control which metric/dimensions to monitor.
How can you use anomaly detections in Google Analytics?
You might ask, how can I use these anomaly detections?
I would recommend keeping an eye out for these kinds of alerts, as they might hold some pretty valuable insights into your business.
In the example above you can see how a page have got a large increase in pageviews. This could just as well have been the opposite: a decrease. Wouldn’t you want to know if your top performing pages had a sudden decrease in pageviews? Or if your e-commerce store had a large decrease in revenue? Could also be that your bounce rate increase out of the blue.
These kind of anomaly detections can provide you with great insights that you would never have discovered with the static alerts mentioned earlier — simply because you might not have set the alert up in the first place.
My hopes for the anomaly detection feature in Google Analytics are that it will become more advanced.
Being able to select what metrics/dimensions to monitor (as a minimum), reviewing the forecasted bounds for all metrics among many other things are on my wish list, as I think there is a great deal of insight into your data.