Reducing False Anomaly Detection Alerts on Google Analytics Time Series Data with BigQuery ML ARIMA Plus

Published in

Badal-io

13 min readApr 11, 2023

Introduction

Anomaly detection refers to the process of identifying and flagging unusual patterns and outliers deviating from the norm and which may warrant further investigation. This includes irregularities such as website errors, data breaches, or ineffective marketing campaigns.

A key challenge with anomaly detection is the high frequency of false alerts which are often caused by the absence of an automated tuning system. In other words, false alerts may occur when the anomaly detection system identifies an event as an anomaly, but in reality, it is a normal event or a benign anomaly. These false alerts can be costly, as they can waste time and resources by requiring manual review or investigation.

In this blog post, we provide an anomaly detection framework for user behavior time series data on Google Analytics events data to demonstrate a solution for detecting actual anomalies. The framework is designed to identify anomalous events based on the counts of the events at different levels of granularity. To achieve this, BigQuery ML ARIMA Plus is used, which takes trends and seasonality of data into account. All models and configurations can further be generalized to other types of user behavior time series data as designed in this Github Package.

Google Analytics and Time Series Data

Anomaly detection is an offered feature by BigQuery ML and uses machine learning algorithms to automatically identify significant changes in website traffic or user behaviours.

Google Analytics data provides insights into various aspects of website performance and user behavior, usually structured in tables and consisting of multiple dimensions and metrics. Dimensions, such as page URL, user location, and device type, categorize data, while metrics, such as page views, bounce rate, and session duration, quantify specific aspects of user interactions. Time is a crucial component in Google Analytics data, as it allows for tracking and analyzing changes in user behavior and website performance over a specified period.

Typical time series data, on the other hand, is a collection of data points indexed in time order, providing a historical record of observations taken sequentially at regular or irregular intervals. This data is often visualized in the form of line charts or bar graphs, highlighting trends, seasonal patterns, or cyclic fluctuations over time. Time series analysis helps forecast future values based on historical data, enabling data-driven decision-making and optimization across a wide range of applications. In this blog, we will engage with time series event data to predict anomalies on a day-to-day basis. The raw data consists of three columns: The collector timestamp, the unique identifiers, and the event types that the identifiers are grouped by.

Logic and Workflow

In order to detect anomalies, four major steps are taken which are described in detail in the sections that follow:

Data aggregation and wrangling: Event count aggregations in different levels of temporal granularity and major data cleaning/wrangling steps
ML Modeling & Forecasting: Training and forecasting several Machine Learning models per event
Automated Model Selection Pipeline Per Event: Extracting features from the forecast properties as an indicator of model practicality or quality, and developing a model selection pipeline for each event
Alerting Dashboard: Building an alerting dashboard

Figure 1- Major Steps of Anomaly Detection Workflow

Models are fully integrated with DBT, scheduled and executed through Airflow to BigQuery and then visualized using Looker. While DBT is a powerful tool for data transformation and modeling, it does not have native support for BigQuery ML model creation and inference functions. To overcome this limitation, the dbt_ml package can be used to develop ML models.

Figure 2- Main Technologies Associated with the Project

Data Aggregation and Wrangling

Having multiple levels of temporal granularity for event count aggregation is beneficial for training ML models. By having counts aggregated at different levels, the variety of models per event is increased, leading to a higher chance of assigning a more accurate model to each event. This approach takes into account the different frequencies at which events may occur, which can result in varying event counts over time. Here, event counts for each event are aggregated within four different levels of granularity: every 4, 8, 12 and 24 hours.

Resetting the launch date for each event to the date when the event count surpasses a minimum value is crucial in order to avoid training on low counts, which can lead to unstable and unreliable forecast patterns. This allows for a more accurate representation of the event’s behavior and provides better results when making predictions. Here, the launch date for each event is reset to the timestamp when the event first reaches the count of 50.

It is important to perform outlier detection and replacement on the training set to ensure accurate modeling. Training ML models on outliers can result in misleading forecast patterns and incorrect statistical analysis. Outlier detection and replacement helps to ensure that the models are trained on a representative sample of the data, leading to more reliable and accurate predictions. Here, an adjusted version of the interquartile range (IQR) method is used to detect and replace outliers. The steps to use this method are:

Calculate the first (25th) and third (75th) quartiles (Q1 and Q3) of the data.
Calculate the IQR by subtracting Q1 from Q3.
Define the lower and upper bounds for outliers as Q1–4.5IQR and Q3+4.5IQR respectively. 4.5 is chosen based on experiments with historical data.
Any data points outside these bounds are considered outliers, and are replaced with the bound itself.

The method helps with capturing the extremely out of range data points in the train set.

ML Modeling and Forecasting

The ARIMA PLUS model in BigQuery ML can be used for time series forecasting and the TIME_SERIES_ID_COL option allows for the simultaneous training and testing of multiple models. This feature can be useful when we want to train and test models for multiple events at the same time. By using the TIME_SERIES_ID_COL, we can efficiently analyze the behavior of multiple events and make more informed predictions based on this analysis.

The ARIMA PLUS model in BigQuery ML includes the ML_DETECT_ANOMALIES inference function that can be used for anomaly detection in time series data. The ML_DETECT_ANOMALIES function calculates the anomaly probability for each timestamp based on the actual time series data, the predicted time series data, and the variance from the model training. If the anomaly probability exceeds the specified threshold value (anomaly_prob_threshold), the actual data point is considered anomalous. The threshold value determines the size of the interval used to identify anomalies, and a higher threshold results in a larger interval size. This feature provides a convenient way to identify and detect unusual or unexpected behavior in time series data. Here, thresholds are customized by experimenting with historical data.

The choice of the time interval for the training data can impact the forecast. For example, if the training data only covers a short period of time, the model may not be able to capture longer-term trends and patterns in the data. On the other hand, if the training data covers too long a period of time, the model may not be able to effectively capture more recent and relevant changes in the data. Here, the training intervals include 2 months, 1 month and half-a-month worth of training data.

Automated Model Selection Pipeline

Selecting the best ARIMA PLUS model for each event is crucial in ensuring accurate and reliable forecasts. This is why it’s important to evaluate the performance of different models in terms of the number of anomalies detected and the root-mean-square deviation (RMSD) of the bounds. The latter can be used as a measure of the accuracy of the bounds generated by the ARIMA PLUS model. Here, the best model is defined as the one which produces the minimum number of anomalies in the forecast set. In cases where more than one model has the same minimal number of anomalies for a specific event, the one with minimum RMSD is chosen. The diagram below showcases the main stages of model/configuration selection pipeline:

In some cases, the lower bound predicted by the ARIMA PLUS model may end up being negative, which is not meaningful or appropriate for events that can only have non-negative counts (e.g., user interactions, purchases, etc.). To account for this, it is common practice to reset the lower bound to a small positive value to ensure that the model produces meaningful and feasible predictions. This helps to avoid potential issues that may arise from having negative predictions, such as instability in the model or unreliable forecast patterns. By resetting the lower bound to a small positive value, we can ensure that the model produces predictions that are both meaningful and feasible for the case study.

Forecasting time series data requires careful consideration of when the events in question started and the amount of historical data available for each event. In many cases, newly launched events may have unstable trends or patterns that can make forecasts unreliable, so it’s important to wait for the events to stabilize and collect sufficient data before attempting to generate forecasts. Here, events launched within the last 30 days are ignored for alerting purposes.

This stage is followed by choosing the configuration with the least number of anomalies and the minimum RMSD. This can ensure that the chosen models are robust and reliable. The chosen configurations for each event form the control table.

Finally, combining the chosen models and their properties with the original forecasts to create a base table for alerting is an important step in the anomaly detection process. This allows us to quickly and easily monitor the actual event counts against the forecasted values, identify potential anomalies or deviations, and trigger alerts when necessary.

Figure 4- A Control Table Consisting of two Events

Alerting Dashboard

Figure 5 displays how the alerting notification looks like. Alerts are set up on a table representing the anomalous events from yesterday, and they can simply be scheduled to a Slack channel. Once the alerts are received, the responsible individuals should look into the issue and determine potential solutions if needed.

Figure 5- Alerting Notification on Slack — Alerts are set up on a table representing the anomalous events from yesterday. Therefore, in a case where the current date is Feb 9, user will be notified about the anomalies from Feb 8.

To provide a more in-depth view and examination of the anomalies, line charts have been created to display the fluctuations in event counts over time. These anomalies are indicated by red scattered points on the chart (As depicted in Figure 6).

Figure 6- Line Chart of Event Counts per App-Event (Anomalies are depicted in scattered red points)

Results

We have extracted the configurations associated with one specific event to demonstrate the performance of this framework. From the table displayed in Figure 7, configuration 23 was selected as it had the lowest count of anomalies (and minimum RMSD).

Figure 7- All Configurations associated with an Event Instance. The chosen configuration is highlighted by the red rectangle.

Across all configured events, the dynamic anomaly detection system resulted in a 90% reduction in the average number of alerts confirming that, nearly exclusively, only actual anomalies were alerted and acted upon.

Ultimately, this 90% reduction in anomaly alerts translated to an approximate 71% reduction in issues on the Google Analytics time series data over a 4 month span, as demonstrated in Figure 8. This figure presents the distribution of issues during this time period for 20+ web and mobile apps. The pie chart depicted in Figure 9 illustrates that a majority of the issues were linked to Android events. The existing system addressed these issues, resulting in a notable decrease in Android-related bugs by the conclusion of the four-month timeframe.

Figure 8- Histogram of Monthly Issues over the last 4 Months of 2022

Figure 9- Issues in the last 4 Months of 2022 by App

Challenges

Minimal false alert generation is a common challenge in time series forecasting projects. To address this, multiple steps were taken such as:

Aggregating event counts at multiple levels of temporal granularity
Resetting launch dates for events to the date when event counts surpass a minimum value
Outlier detection and replacement in the training set
Experimenting with different ARIMA PLUS model parameters, such as the anomaly_prob_threshold, to customize the thresholds
Experimenting several training time intervals
Ignoring recent events (launched within the last 30 days) for alerting purposes
Selecting the ARIMA PLUS model with the least number of anomalies and the minimum RMSD

By taking these steps, the goal was to produce reliable and accurate forecasts, while minimizing the number of false alerts generated. Despite implementing various techniques to minimize false alerts, it is still possible to receive false positive alerts due to fluctuations in user behavior or infrequent events that fall below the reset lower bounds. To address this issue, an interactive muting system can be put in place to allow users to temporarily mute specific alerts if they know they are false positives. This can be accomplished by enabling users to manually mute certain alerts. The goal is to balance the number of false positives and negatives to minimize the disturbance to users while still detecting anomalies that may indicate an issue.

The project faced other challenges as well, including unreliable forecast patterns due to outliers (Consider Black Friday spikes as an example). Outlier detection and replacement using IQRs was pursued to address this issue. The IQR method can be adjusted to reduce the impact of extreme values on the model training. Moreover, lower bounds hitting negative values would result in failure in capturing true positives, leading to false negatives. To address this, the solution was to reset the bounds to a small positive value. Additionally, the interval between the bounds increases as the forecast horizon increases, making it less accurate. This can occur because of the increasing uncertainty in the forecast values. As the time horizon increases, it becomes more difficult to predict future values with certainty, resulting in a widening interval between the bounds to account for this uncertainty. To address this, limiting the forecast time intervals based on the desired level of granularity is suggested.

Enhancing Anomaly Detection: Implementing a Feedback Loop for Improved Model Training

As previously stated, a prevalent challenge in anomaly detection modeling is the training of models on anomalies, which can lead to incorrect predictions and false alerts in the forecast set. This problem can be addressed through a feedback loop that identifies anomalies in past forecasts and replaces them with more plausible values in the current training sets. The process is illustrated in Figure 10. To accomplish this, forecasts should be configured incrementally, using historical anomalies to fine-tune current training sets with each iteration. Data points flagged as anomalous in the training set are substituted with predictions from earlier forecasts. This approach can be further improved by validating flagged anomalies via user feedback from the user interface.

Figure 10- Anomaly Detection Feedback Loop Framework

The supplementary feature can be combined with the outlier replacement framework mentioned in the Data Aggregation and Wrangling section to tackle the problem of anomaly-based training. The outlier replacement framework employs a statistical method that addresses extreme outliers, which are then replaced with the computed boundaries (Q1–4.5IQR and Q3+4.5IQR). In contrast, the anomaly detection feedback considers other anomalies, substituting them with historical predictions. While the former is more statistically dependable, particularly when flagged anomalies have not been user-verified, the latter encompasses a broader spectrum of anomalies replaced with more rational values. Going ahead, the combined approach of outlier replacement and anomaly detection feedback loop will be referred to as the Integrated Anomaly Resolution Framework.

Figure 11 presents a comparison of incremental forecasts for a particular event, both with and without the implementation of the integrated system. The dataset includes the latest forecast for each data point at 10-day forecast intervals; consequently, no data points after January 29 are present in any training set.

In Figure 11, it is evident that the data points between January 26 and January 29 (illustrated by the green line chart) are outside the expected range and should be identified as anomalies, However, the remaining data points fall within the acceptable range and should not be considered anomalous. In the lower image, this is accurately reflected (with anomalies represented by red scattered points), while the upper image fails to produce the anticipated results. This figure demonstrates that training models on anomalies can lead to two main issues:

False negatives: As observed in the upper image, the data points between January 27 and January 29 are incorrectly labeled as non-anomalous due to the boundaries (illustrated by the blue line charts) being shifted upwards after being trained on the anomalous spikes on January 26.
Conversely, the data points on January 30 and 31 are mistakenly flagged as anomalous because they fall outside the erroneously shifted bounds, which were influenced by the anomalous surges.
In the absence of the integrated system, the process of identifying anomalies in the forecast set becomes dependent on the position of data points within the set. For instance, data points from February 1 onward are not falsely flagged as anomalies, while those on January 30 and 31 are. This is due to their proximity to the anomalous training set data points. This means that the effectiveness of anomaly detection may vary, as it becomes influenced by the specific placement of data points rather than relying on a more comprehensive and systematic approach. As a result, the accuracy and consistency of detecting anomalies could be compromised without the implementation of the Integrated Anomaly Resolution Framework.

Figure 11- Incremental Forecasts before and after Implementation of Integrated Anomaly Resolution Framework (Forecast Interval = 10 days)

Conclusion

Anomaly detection systems can help businesses identify unexpected events or patterns in their data that might indicate fraud, errors, or other issues. These systems can be particularly valuable in industries such as marketing, finance, healthcare, and cybersecurity, where even small anomalies can have significant consequences.

By incorporating an automated tuning system, the anomaly detection system can adapt to changes in data patterns and adjust its algorithms to optimize detection accuracy. This can help businesses identify anomalies more quickly and accurately, reducing the risk of false positives or false negatives.

The business value of an anomaly detection system with an automated tuning system includes:

Early detection of anomalies: Anomaly detection systems can detect unusual patterns and behaviors in data that might otherwise go unnoticed. This can help businesses respond quickly to potential issues before they escalate into larger problems.
Improved efficiency: An automated tuning system can optimize the anomaly detection system’s algorithms, reducing the need for manual intervention and saving time for data analysts.
Cost savings: By detecting issues early and automating the tuning process, businesses can save money on investigations, remediation, and potential legal fees.
Improved customer satisfaction: Anomaly detection systems can help businesses identify issues that might affect customer experience or satisfaction, allowing them to take action to address these issues and improve customer relationships.

The provided framework has successfully demonstrated the significance of actual and early anomaly detection on addressing issues related to Google Analytics time series data.