Time Series Anomaly Detection: Uncovering Clues to Mysterious Cyber Activities

Georgian

Published in

Georgian Impact Blog

7 min readNov 4, 2022

By: Benjamin Ye

Anomoly on the time series express picture with train and data in the background — Credit: Adapted from Agatha Christie’s Murder on the Orient Express

Introduction

There has been a murder on the Orient Express! Among the passengers, detective Hercule Poirot must find the mastermind behind it all!

In many detective stories, time plays a crucial role in finding the eventual suspect. It’s not enough to base a conjecture on point-in-time information, as is the case of Agatha Christie’s detective novel; only from the context given by passengers’ personal histories could detective Poirot piece together the clues to solve the case.

For the world of data science, similar kind of detective work plays out in the form of Time Series Anomaly Detection. The premise is simple: given a history of observations, how do we find times when something unusual has happened?

Before answering that question, let’s understand why we care about finding anomalies in the first place.

Why Anomalies Matter

Detectives, in the real world and in books, dig into the unusual and the anomalous because they often contain clues to some malicious act. This holds true for data scientists as well — finding anomalies goes hand in hand with cybersecurity tasks such as intrusion and fraud detection.

Imagine a hacker takes control of your bank account and starts withdrawing money. The hacker will certainly leave a trail of clues when you examine the history of your account balance. A steady decrease in the balance giving way to a sudden, violent drawdown is a sure sign that something is wrong. Time to call the bank!

If companies are able to detect these sudden shifts in behaviors at scale, they can take steps to preemptively address potential security issues. In turn, customers would feel (and be) safer and have fewer headaches.

Georgian & Cybersecurity

At Georgian, we believe in a magical customer experience — and nothing is more magical than having strong and scalable cybersecurity solutions that make the internet safer. And sorta like magic: when it works, you wouldn’t notice! The magic of simple yet robust cybersecurity solutions is why we invest in companies that have made it possible to deploy turnkey solutions while ensuring maximum security.

Some of our portfolio companies that are trying to make this happen include Noname Security and DefenseStorm. We collaborated with them to enhance their products via time series anomaly detection.

Noname Security

Noname Security provides a suite of products that proactively protects API endpoints from attacks. As a part of its security solution, Noname monitors API usages in real time and automatically detects anomalous usage patterns.

To achieve performance at scale — in both prediction accuracy and computation speed — we explored the latest time-series techniques with low overhead that can be used to monitor streaming data from millions of endpoints concurrently.

DefenseStorm

DefenseStorm provides financial institutions with a unified solution for cyber compliance, security, and fraud-detection as consumers increasingly turn to banks’ online offerings. One of DefenseStorm’s offering is PatternScout: it learns common access patterns and alerts cybersecurity analyst when it detects significant deviations from baselines.

To enhance the capabilities of PatternScout, we incorporated a lightweight time series outlier detection model. The addition of temporal data has greatly increased the precision of PatternScout alerts while the low computation cost proved essential when scaling across multiple ports and customers.

Time Series Anomaly Detection 101

In many situations, a rule-based anomaly detection system is enough. But for companies whose task is to detect anomalies at scale across millions of different data sources and different anomalous situations — like Georgian’s customers — a general-purpose detection system is needed. Good news: there exists a wide body of methods, from classical statistics to state-of-the-art transformer models, that accomplish just that. In the following sections and future blog posts, we’ll take a cursory look at time series anomaly detection: its lexicon, methodology and recent research.

Anomaly Types

Before diving into the different approaches, it’s useful to get to know different types of anomalies that occur in time series. Lai et al. (2021) proposed the following classification scheme.

First, we have Point anomalies — anomalies that persist for only one period. This anomaly is characterized by a sudden jump. In Point Global anomalies, the jump brings the point outside a range defined by the whole time series. On the other hand, for Point Contextual anomalies, the jump only brings the point outside a range defined by the neighboring points.

The other type of anomaly is called Pattern anomaly — this family of anomalies is defined by a deviation in patterns (hence the name). This type of anomaly can last for multiple periods.

Shapelet anomalies are defined by a shift in the waveform; Seasonal anomalies are defined by a shift in frequency; and Trend anomalies are defined by a shift in trend.

For a concrete definition of these anomalies, we encourage the reader to check out the paper.

Anomaly Detection: the General Approach

The task of anomaly detection can be boiled down into two steps: anomaly scoring and thresholding.

The anomaly scoring function ϕ: Rⁿ → Rⁿ, is one that casts time series into a series of scores where higher values signify higher degree of anomalousness. For most models, the scale of anomaly scores generated is arbitrary and does not hold inherent meaning (unlike, say, the output of a classifier where the output is the probability of belonging to a certain class). This necessitates the use of a thresholding function that turns the raw scores into useful true or false predictions.

The thresholding function τ: Rⁿ → Rⁿ generates a series of threshold where if ϕₜ > τₜ then the data at timestamp t is labeled as an anomaly.

To use anomaly scoring models, it is often necessary to cast time series into a series of subsequences via a rolling window. Model performances can vary greatly depending on the chosen window size. Common heuristic for choosing the rolling window size is to have it span at least 1x the periodicity (if any) of the time series (a thorough review and evaluation of methods for finding the periodicity can be found in this paper). It can also be tuned using hyperparameter tuning tools such as Optuna.

Anomaly Scoring

Recall that the formal definition of time series is an ordered set

X = {x₀, x₁, …, xₙ₋₁, xₙ}

such that

xₜ = f(x₀, x₁, …, xₜ₋₁) + ϵ

where ϵ denotes noise.

From the above, we can see that every point in a time series can be defined by preceding values. Compared with tabular data — where one can shuffle the rows — with time series, one cannot — since the ordering of the values encode the relationship f(⋅). This is very similar to image data where one can’t switch pixels around.

The main families of anomaly scoring models include prediction-based, reconstruction-based, and distance-based approaches. Each of these methods exploit the definition of time series in their own ways.

Prediction-based models take advantage of the fact that there exist a relationship f(⋅). If we can figure out what that signal generating function is, then we know what constitutes normal behaviour and vice-versa.

The intuition for reconstruction-based models is similar. If every point relates to previous ones in some way (as opposed to pure noise), then we can train a model that learns a low-dimensional latent representation and reconstruct the subsequence with said representation. If the reconstructed subsequence looks very different, then it’s an indication that the input is anomalous.

Distance-based models use the fact that values in a time series are all numeric, so that distance between subsequences can be meaningfully computed. If the distance between one subsequence and others is large, then it is likely an anomaly.

Lastly, we also have to consider situations where we have multivariate time series data; for example, readings from multiple sensors. In such a case, not only do data exhibit temporal dependency (where current reading correlates with past values), they may also be inter-variable dependencies — where reading from sensor A’s correlates to that of sensor B. To detect anomalies which arise from inter-variable correlation shifts, we have to employ models that innately support multivariate data.

Next Up…

That’s it for our first post in a series on time series anomaly detection. In this post, we walked through its use cases and how we have helped our customers to implement such systems. Following that, we covered the general approach of detecting anomalies in time series.

In the next post, we will give a more thorough walkthrough of the different methods. We will dive into different anomaly detection algorithms, and peek into what they do under the hood. We will also introduce several techniques that transform anomaly scores into useful predictions. Lastly, we will introduce a neat toolkit that makes it easy to experiment with different anomaly models and thresholding methods.

Stay Tuned!