Algorithm selection for Anomaly Detection

Sahil Garg
Analytics Vidhya
Published in
8 min readJun 2, 2020

Anomalies can be defined as observations which deviate sufficiently from most observations in the data set to consider that they were generated by a different, not normal, generative process. Anomaly is any observation which deviates so much from other observations in the data set so as to arouse suspicion. In brief, anomalies are rare and significantly different observations within a data set.

Anomaly detection algorithms are now used in many application domains for Intrusion detection, Fraud Detection, Data leakage prevention, Data Quality, Surveillance & Monitoring. As we see these are wide variety of applications, some require very fast, near real time anomaly detection whereas some require very high performance due to high cost of missing an anomaly. Anomaly detection techniques are most commonly used to detect fraud, where malicious attempts/transactions often differ from most nominal cases. Outlined below are the different types of anomalies:

Point anomaly: Single anomalous instances in a larger dataset

Point anomaly: Single anomalous instances in a larger dataset

Collective anomaly: If an anomalous situation is represented as a set of many instances, this is called a collective anomaly.

Contextual anomaly: In contextual anomalies, point can be seen as normal but when a given context is taken into account, the point turns out to be an anomaly.

Anomaly Detection Algorithms

The solution to anomaly detection can be framed in all three types of machine learning methods — Supervised, Semi-supervised and Unsupervised, depending on the type of data available. Supervised learning algorithms can be used for anomaly detection when anomalies are already known and labelled data is available. These methods are particularly expensive when the labeling has to be done manually. Unbalanced classification algorithms such as Support Vector Machines (SVM) or Artificial Neural Networks (ANN) can be used for supervised anomaly detection.

Semi-supervised anomaly Detection uses labelled data consisting only of normal data without any anomalies. The basic idea is, that a model of the normal class is learned and any deviations from that model can be said to be anomalies. Popular algorithms: Auto-Encoders, Gaussian Mixture Models, Kernel Density Estimation.

Unsupervised learning methods are most commonly used to detect anomalies, the following chart outlines major families of algorithms and algorithms which can be used for anomaly detection.

K-Nearest Neighbor (kNN): kNN is a neighbor based method which was primarily designed to identify outliers. For each data point, the whole set of data points is examined to extract the k items that have the most similar feature values: these are the k nearest neighbors (NN). Then, the data point is classified as anomalous if the majority of NN was previously classified as anomalous.

K-NN

Local Outlier Factor(LOF): Local Outlier Factor is a density-based method designed to find local anomalies. For each data point, the NN are computed. Then, using the computed neighborhood, the local density is computed as the Local Reachability Density (LRD). Finally, the LOF score is computed by comparing the LRD of a data point with the LRD of the previously computed NN.

Local Outlier Factor

Connectivity Based Outlier factor (COF): Connectivity-based Outlier Factor (COF) differs from LOF in the computation of the density of the data points, since it also considers links between data points. To such extent, this method adopts a shortest-path approach that calculates a chaining distance, using a minimum spanning tree

Connectivity Based Outlier factor

K-Means: K-means Clustering is a popular clustering algorithm that groups data points into k clusters by their feature values. Scores of each data point inside a cluster are calculated as the distance to its centroid. Data points which are far from the centroid of their clusters are labeled as anomalies.

K-means Clustering

Robust Principal Component Analysis(rPCA): Principal component analysis is a commonly used technique for detecting sub-spaces in datasets. It also serves as an anomaly detection technique, such that deviations from the normal sub-spaces may indicate anomalous instances. Once the principal components are determined major components show global deviations from the majority of the data whereas using minor components can indicate smaller local deviations.

Robust Principal Component Analysis

One Class SVM: One-class Support Vector Machine algorithm aims at learning a decision boundary to group the data points. It can be used for unsupervised anomaly detection, the one-class SVM is trained with the dataset and then each data point is classified considering the normalized distance of the data point from the determined decision boundary

One Class SVM

Isolation Forest: Isolation Forest structures data points as nodes of an isolation tree, assuming that anomalies are rare events with feature values that differ a lot from expected data points. Therefore, anomalies are more susceptible to isolation than the expected data points, since they are isolated closer to the root of the tree instead of the leaves. It follows that a data point can be isolated and then classified according to its distance from the root of the tree.

Isolation Forest

Angle Based Outlier detection (ABOD): Angle Based Outlier detection (ABOD) relates data to high-dimensional spaces, using the variance in the angles between a data point to the other points as anomaly score. The angle-based outlier detection (ABOD) method provides an good alternative in identifying outliers in high-dimensional spaces

Angle Based Outlier detection

Mahalanobis Distance: Mahalanobis method solely the distance space to flag outliers. The Mahalanobis distance is suitable for anomaly detection tasks targeting multivariate datasets composed of a single Gaussian-shaped cluster. The model parameters are the mean and inverse covariance matrix of the data.

Mahalanobis Distance

Neural Networks such as Self Organizing Maps: Also called Grow When Required (GWR) network, it is a reconstruction based non parametric neural network. It fits a graph of adaptive topology lying in the input space to a dataset. Severe outliers and dense clouds of outliers are correctly identified with this technique.

Gaussian Mixture Models: Gaussian Mixture Model (GMM) fits a given number of Gaussian distributions to a dataset. The model is trained using the Expectation-Maximization algorithm which maximizes a lower bound of the likelihood iteratively. Assessing the number of components of the mixture by data exploration can be complex

Auto-Encoders: An autoencoder is a special type of neural network that copies the input values to the output values. The key idea is to train a set of autoencoders to learn the normal behavior of the data and, after training, use them to identify abnormal conditions or anomalies.

Comparative evaluation of Algorithms

Anomaly Detection algorithm selection is complex activity with multiple considerations: type of anomaly, data available, performance, memory consumption, scalability and robustness.

Performance metrics for anomaly detection models are mainly based on Boolean anomaly/expected labels assigned to a given data point such as Precision, Recall, F-score, Accuracy and AUC. The image below, from research paper: Quantitative Comparison of Unsupervised Anomaly Detection Algorithms for Intrusion Detection, shows relative performance of algorithms families against the performance metrics.

Training & Prediction time: Scalability and consumption of the model can be judged by computation and prediction time required by the different methods when increasing the data set size and dimensionality. The graphs below from paper: A comparative evaluation of outlier detection algorithms: Experiments and analyses, highlights relative performance of the algorithms with respect to increasing number of features and training time and prediction time.

Robustness & Scalability: Many anomaly detection methods suffer from the curse of dimensionality. It is important to check resistance of each algorithm to the curse of dimensionality, where we keep a fixed level of background noise while increasing the data set dimensionality. The graph below from paper: A comparative evaluation of outlier detection algorithms: Experiments and analyses, highlights relative performance of the algorithms with respect to increasing number of features and average precision.

Memory usage: Several algorithms have high memory requirements, thus choice of algorithm should be done carefully with due consideration to available hardware and scalability requirements. The graph below from paper: A comparative evaluation of outlier detection algorithms: Experiments and analyses, highlights relative performance of the algorithms with respect to increasing number of features and memory usage.

Similarly, another paper ,A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data, published multiple findings for comparative universal evaluation of anomaly detection algorithms on publicly available datasets. The following table contains recommendations from this paper, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains.

Accuracy — Performance Measure; Deterministic — stability of scoring; Sensitivity — model sensitivity to parameters; Speed — Computation time for larger data sets; — represents Very bad; O represents average; ++ represents very good

References:

--

--