Introduction to Anomaly Detection in Time-Series Data and K-Means Clustering

Introduction to anomaly detection and time-series data

Bora Kizil
The Startup
5 min readOct 30, 2020

--

Incredible amounts of data are being created and collected at any time in a variety of sectors. Autonomous vehicles continuously receive information about their surroundings from sensors to navigate on the roads; large antennas emit radio transmissions at fixed frequencies to millions around the globe; major banking corporations track transactions to prevent fraudulent behaviour. Such data collected at regular intervals into a dataset is known as a time-series. These types of datasets are typically used for statistical analysis but are increasingly used for machine learning purposes. A key area in which time-series are crucial is anomaly detection.

Figure 1 — The evolution of COVID-19 cases over a month can be considered as time-series

Data collected from a source for tracking purposes usually follows expected patterns, considered to be ordinary behaviour. For example, the evolution of the sea level due to tides is expected to be sinusoidal. However, this regular state can be disrupted by an exceptional event which will translate into an unusual pattern unlike any other in the dataset. Such an event is an anomaly. Depending on the nature of the time-series and the anomaly, different features (ie. properties of the observed event) are used to detect it.

Anomalies can be a collection of data points which deviate from the trend. In this case, they are called outliers. Contrary to classic statistical analysis, outliers can’t simply be eliminated from the dataset on the basis of experimental flaws or random noise, since they could indicate serious anomalies.

Figure 2 — Example of an outlier on a basic graph

In the case of univariate series (a simple time-series with only one parameter varying over time), such as depicted in Figure 2, a simple graph is enough to observe outliers.

Other methods have to be used in the case of multivariate series, that is when we wish to model and study the interactions among several variables. Machine Learning provides useful methods, such as:

  • Local Outlier Factor (LOF). Outliers are detected by their local density, which is expected to be low, in comparison to that of their neighbours. This is done by choosing a number of immediate neighbours k and determining the distance between a particular point and its k-th neighbour. Then, the local reachable density of each point is calculated (intuitively, this is how far we have to travel from one cluster to another). From this, we can deduce the local outlier factor for every point. Generally, an outlier is detected if his LOF is higher than 1. This method easily detects unique and obvious outliers but shows its limits when detecting outliers close to a cluster or a group of outliers.
  • One Class SVM. For a given dataset where data points belong to one particular class, the goal is to determine if a new data point will belong to this class or not (in which case it is an outlier). To this end, data points belonging to this one class are isolated from the rest thanks to hyperplanes. If the points can be scattered on a plane, they are separated thanks to lines; if they are scattered in three-dimensional space, they are separated thanks to planes. It is usual to have n-dimensional calculations.
  • LSTM autoencoder. This neuronal network method recreates the input data based on the data it was trained on. The reconstruction error is evaluated, and if it is above a certain threshold, an anomaly is detected.

All three above methods are examples of, respectively, unsupervised, supervised and deep learning models.

In the following section, we will describe the process of detecting outliers thanks to another unsupervised learning algorithm: K-means clustering. This method is simple to implement and accessible thanks to its clear visual representation.

K-means clustering

This method looks at the data points in the set and groups those that are similar (e.g. through Euclidean distance) into a predefined number K of clusters. A threshold value can be added to detect anomalies: if the distance between a data point and its nearest centroid is greater than the threshold value, then it is an anomaly. A typical K-Means Clustering algorithm using Euclidean distance follows these steps:

  1. Randomly assign a number, from 1 to K, to each of the observations.
  2. Iterate until the cluster assignments stop changing:
  3. For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster.
  4. Assign each observation to the cluster with center closest to the observation using Euclidean distance, unless it is greater than the threshold value.
Figure 3 — Visual representation of K-Means Clustering with no threshold for K=2

This method isn’t ideal, however. The main difficulty resides in choosing K, since data in a time series is always changing and different values of K might be ideal at different times. Besides, in more complex scenarios where there are both local and global outliers, many outliers might pass under the radar and be assigned to a cluster. Another possibility is forming a cluster of anomalies if there are several of them that are alike, which would mean new anomalies would be considered as part of the normal dataset.

Graphs and clustering techniques are great ways to visually observe anomalies, but many more methods exist. An intuitive one is the use of min-max. A regular data pattern can be considered to be contained between a minimal and maximal value. If a data point exceeds this min-max interval, it may be regarded as an anomaly. Another important method uses the derivative feature: if the data pattern changes much faster (or slower) than usual, this can indicate an anomalous event. The key is to use complementary methods: anomalies detected through a change of variation might pass the min-max test.

Though many methods exist to detect unusual events in a time-series dataset, these are intrinsically unsupervised techniques. An outlier detected through K-Means Clustering might actually not be an anomaly; in this case, human input is required to teach algorithms whether exceptional events should be noticed or ignored. This is the basis of anomaly detection: man-machine cooperation to transition from unsupervised learning to supervised learning, an application that can be used in a variety of sectors.

--

--

Bora Kizil
The Startup

Co-founder at Ezako (www.ezako.com), the time-series solutions company. We help our clients with anomaly detection, labeling and forecasting problems.