Above the noise: detecting anomalies in normal signals
At EmpathyBroker we use a variety of Key Performance Indicators (KPIs) to evaluate and monitor the performance of search. We need to be able to monitor KPI changes across our customer base and be ready to react to any anomaly as these changes could be indications of issues that need to be corrected; site malfunctions, miss-configurations, software updates not performing as expected, etc.
One of the most important KPIs for us is findability: a key measurement we use to assess how easily users can navigate the search engine results and essentially find what they’re looking for. Findability, defined and computed in the back-end by our team of data engineers, is used as an indicator of the behaviour and success of the search engine. It’s calculated at fixed time intervals and the findability values for a site will form a time series that we need to continuously analyze in order to be able to define actionable alerts in case there are any signal anomalies.
Anomaly detection in an arbitrary time series is an open-ended problem: the better you can understand and model the signal, the more accurate, and less prone to false alarms, your alerts will be. Most of the improvements will come iteratively as data is gathered and the short and long-term behaviour of the KPIs is better understood, meaning insights can be obtained to understand what constitutes an actionable alert.
To bootstrap this process we take the simplest possible starting point: we will assume that the findability values follow a normal distribution that is stationary, in other words it doesn’t change characteristics with time.
Characteristics of a normally distributed signal
Many physical processes are normally distributed. Even if an underlying process is not normally distributed, averaging it, for example as we compute the findability of a site for a certain time interval, should result in a variable that is normally distributed (see the Central Limit Theorem). This is what makes the normal distribution a good starting point to characterize a signal.
A normal, or Gaussian, distribution (Fig. 1) is characterized by its mean and its standard deviation. For a set of normally distributed values, their sample mean and standard deviation are good and easy to calculate estimators of the underlying normal parameters.
It’s sometimes useful to view a probability distribution in terms of the probability or likelihood of a value: for a normal distribution, the further away from the mean, the least likely a value is. But, a normal distribution is also symmetric about its mean; the chance to find a value above average is the same as finding a value below average (see Fig 1).
For example, if we assume that IQs have a mean value of 100 and are normally distributed with a standard deviation of 15 it means that about 68% of people will be within a standard deviation from the mean, that is 85–115. There is about a 95% chance a person randomly chosen will be within two standard deviations (70–130) and approximately 99% that they will be within three standard deviations (55–145).
When dealing with different distributions it’s common to talk about thresholds in terms of standard deviations (σ), this way it’s possible to apply the same criteria to different normally distributed data without the need to handle units or specific numbers. Thresholds of ±3σ (99.7% of data within the threshold) have been chosen as guidelines to detect anomalies in this article.
Findability as a normally distributed time series
As mentioned, findability will be the signal used to check for anomalies. As a first step, we take the findability of a site, measured hourly, and check that it is normally distributed.
The above graph (Fig.2 ) shows a series of findability values measured every hour from 1st August 2018–7th October 2018. Despite very small variations it’s largely stationary with a constant mean (μ=26.1) and variance (σ=2.4).
A histogram of the data (Fig. 3) reveals symmetry characteristic of normally distributed data. There are a wide range of software tools to test for normality. For this series, a Python package was used based on D’agostino and Pearson’s test, Scipy’s normaltest.
The resulting test value for this distribution is 0.76 and therefore it can be said that the time series in Fig. 2 is normal; conventionally, a test result above 0.05 tells us that a distribution is normally distributed with a good degree of significance.
Anomaly detection on normally distributed signals
An anomaly is a value unlikely to happen. When an unlikely value occurs it may be an indication that something needs to be fixed, demonstrating the importance of having an anomaly detection system.
To decide whether a value is an anomaly or not a threshold has to be set. We will use the characterization of the normal distribution that we carried out previously as a guideline in defining these thresholds. Feasible thresholds will however require a compromise: an excessively low threshold produces many alerts while an excessively high threshold may miss anomalous data.
The findability series in Fig. 3 contains over 1,600 data points, thus a 3σ threshold would produce around five alerts in a two month period, due to signal fluctuations alone (Fig. 4). Due to the shape of the distribution, the number of alerts decreases very fast in turn increasing the threshold. With a 4σ threshold, the expectation is lower than one random alert per year. Naturally, signals calculated on a slower time bases allow for sharper alerts: a 3σ threshold on a daily signal also has a one yearly random alert expectation.
Another powerful technique is to check for anomalies on consecutive happenings; the chance of two or more successive values randomly exceeding the threshold drops significantly (while most of the real anomalies will have longer timescales). Two consecutive values over 3σ have a 1 in 100,000 chance of happening whereas the chance of four values exceeding 3σ in a 24 hours period is of 1 in 50,000 (which would mean a random alert every six years).
Anomaly detection example
To check how our simple anomaly detection algorithm performs we use a 3σ threshold using 10 days worth of data (Fig. 5). While in the previous sections we compared the signal to a static threshold, here we take into consideration the real-time nature of any alerts system and allow for a threshold that gets modified with changing conditions, but more slowly than the signal.
Unfortunately the findability for this customer does not behave normally. Even prior to the software update we have more alarms than we would expect from random fluctuations alone. This departure from normality is not entirely unexpected, especially when dealing with hourly timescales: for example different findability in mobile and desktop, coupled with different access profiles along the day could introduce systematic effects that will make the data less normal.
Even in this scenario an alert system that would flag a few consecutive or near consecutive 3σ threshold violations would pick the change in findability on or about 1st September as expected. The optimization of the different parametres (threshold level and window length, number of samples required for an alert, etc.) is very much dependent on both the signal characteristics and the desired sensitivity to the different anomalies that can occur.
Outlook
Normality provides valuable insight into a signal’s range of future values. Upon considering the context of a time series and setting thresholds accordingly it’s possible to assess the extent to which a value is anomalous. Furthermore assuming successive data points are independent, the probability of finding consecutive anomalies drops considerably thus improving the significance of detected anomalies.
On its own, knowledge of stationary and normal series may not seem relevant as they are often amalgamated with trends and cycles that overwhelm data. However, there are techniques to remove any patterns from complex signals to isolate stationary, normally distributed data which is usually called noise. This noise may then be studied and used as a tool to model predictions, define uncertainty periods or detect anomalies as illustrated in this article.