Characteristics of Anomaly Detection Problem

Iurii Katser
Product AI
Published in
3 min readSep 21, 2021

A fairly complete and deep survey on the Anomaly Detection (AD) problem and its different aspects is available here. In this article, we will introduce and briefly describe various characteristics of the AD problem.

Processing type: There are off-line and on-line processing types.

· The off-line type is set when there is a whole set of data, therefore, for the off-line type, one can get the optimal solution.

· The on-line type is set when data points come once at a time (real-time) or in batches (subset of points), and anomaly starts (changepoint) must be found as soon as they occur.

Data: Although the data is often classified into structured, semi-structured, and unstructured types (details here), it is more convenient to consider data being pre-processed and transformed into ready-for-ML. In this case, classification data according to modality is more useful because AD techniques for various data types often differ significantly.

· Tabular: This is data that is structured into lines, each of which contains information about individual objects. Each row contains the same number of columns (some values may be missed) that represent the property values of the object described by the row.

· Time series: This is a univariate or multivariate data observed across time in a sequential manner. In a special case data is observed at pre-determined and equally-spaced time intervals (such as yearly, monthly, quarterly or hourly). Generally, Time series data is a special case of tabular data which often has an index of timestamp format.

· Audio: This is a special case of Time Series data when the sound is collected sequentially. More details about what sound and Audio are can be found here.

· Images: This is often a tensor or multidimensional array, where two dimensions (rows and columns) represent spatial coordinates (x and y axis) and the third one represents the intensity or gray level of a pixel.

· Video: It is usually a combined type of Audio and Time Series of Images (each instance is of image type) types.

· Text: This is either separate or combined into phrases, sentences, and texts words.

Details about data types from a machine learning perspective can be found here.

Modes by data labels: In relation to the data labels, modes can be divided into supervised, semisupervised andunsupervised. Data labels refer an each data point to a normal or anomaly class (or one of anomaly classes).

· Supervised training mode needs labeled data points for both normal and anomaly classes.

· Knowing of the normal or anomaly-free class and having mark-up for it allows for solving semisupervized task.

· Unsupervised methods are most commonly used because they do not require training data. These methods are often based on the assumption that the quantity of anomalous instances is much smaller than normal ones.

Output of AD Algorithms: There are two main output types of AD algorithms:

· Scores: When AD algorithm outputs for each data instance a degree of abnormality. It allows defining of the border of abnormality flexibly in the post-processing stage.

· Labels: When AD algorithm outputs for each data instance a label or class (normal/anomalous).

Anomaly type: Anomalies are often divided into point, contextual and collective.

  • If a single data instance displays unexpected behavior related to the rest of data, then that instance can be called a pointanomaly.
  • If a single data instance displays unexpected behavior related to some part of the surrounding data (context), then that instance can be called a contextual anomaly
  • If a set of data instances displays unexpected behavior related to the rest of the data, then that set of instances can be called a collective anomaly

Application domain: Depending on the specific industry or application, anomalies can be classified into various types. Usually, various types refer to the various natures of occurrences of anomalies and imply that various AD techniques and domain heuristics should be used. However, when we move to mathematical problems, anomalies of various domains may be dealt with quite similarly, such as, for sensor network anomalies, cyber-intrusions, and Industrial faults, the same AD techniques are used when the data is of the time series type.

--

--

Iurii Katser
Product AI

Lead DS | Ph.D. alumnus | Researcher | Lecturer. Time-series analysis, Anomaly detection, Industrial data processing