Anomaly Detection Techniques

Shabari Girish

Published in

Nerd For Tech

7 min readApr 23, 2021

Uniqueness is appreciated in LIFE but not in DATA

The identification of anomalies is a method used to recognize irregular trends that do not comply with expectations. The abnormalities are also known as outliers. There are many applications of this approach.

Applications

Fraud Detection: detecting fraudulent usage of credit cards and loan applications
Fault Diagnosis: monitoring engineering processes to detect faults in equipment or finished products in the production line
Time Series Analysis: identifying variations in the parameter trends w.r.t time. For example stock performance, sales data
Medical conditions: identifying medical conditions based on the study of molecular structures, MRI images, etc.,

With all the use cases, one main question is:

How to actually detect anomalies?

While we could manually detect the anomalies, as our brains are really great at this, it is never the best approach. We cannot always have eyeballs on the charts all the time.

So our next go-to method is using more automatic detection techniques by an alerting system based on statistical rules or static thresholds that have to be adjusted manually.

Despite reducing human dependency and notifying users about anomalies, there are some drawbacks of having false negatives, false positives.

This boils down to a considerably more robust method for anomaly detection which is using machine learning.

Our next segment involves Machine Learning with Anomaly Detection. There are 3 main approaches to detect anomalies.

Determination of outliers without previous data information (anomalies). This is analogous to unsupervised clustering.
Supervised machine learning issue with usable labels for both regular and anomalous classes. As anomalies are expected to be very few, this could be considered as binary classification with imbalanced data.
Learning only the labels of the normal class makes it a semi-supervised approach

Trying to balance the imbalanced data will not always serve the purpose of anomaly detection. So we need various other methods…..

I am choosing 5 algorithms from multiple categories (Linear, Proximity-based, Probabilistic, Outlier Ensembles) to explain how they perform anomaly detection.

Angle-Based Outlier Detector (ABOD)

The ABOD algorithm is based on the angle variance among the difference vectors of data objects in the dataset. Thus, in comparison to pure distance-based methods, are the consequences of the ‘curse of dimensionality.’

The angle between the PX and the PY from point P is somewhat smaller than the angles of other points Q and R in the diagram above for the outlined point P. The corner of the most distant data points is less than the angle of the near data points. If you think more, the difference between the farthest data points (of all possible angles to the other data points) and closer data points would be smaller. Thus, the data point is treated as an outlier with less angle variation. Angles in large dimensions are more stable than distances. The angle and the distance between the point are separated during actual use so that the distance is also taken.

K-Nearest Neighbors Detector

KNN is a monitored ML algorithm sometimes used in data science for classification problems (also regression issues sometimes). It is one of the most basic and commonly used algorithms with strong use cases.

The underlying principle is that identical observations are close together and outliers generally represent solo observations, that they are more distant from the cluster with similar observations.

Although KNN is a supervised ML algorithm, it takes an unsupervised approach completely based on the threshold value of distance when it comes to anomaly detection.

Isolation Forest

They are like every other tree ensemble form, based on decision trees. Partitions are generated in these trees, by first choosing a feature randomly and then selecting a split value randomly between the lowest and highest value of the chosen function.

A random function is first chosen to construct a branch of the tree. Next, a random split value is chosen (between min and maximum value). If this function has a lower value, then the chosen one follows the left branch, otherwise the right branch. This process takes place until the maximum depth is isolated or defined.
The outliers are essentially less common and distinct in terms of values than normal observations (they lie further away from the regular observations in the feature space).
This is why they can be found closer to the tree root by using such a random partitioning (shorter average path).

Histogram-base Outlier Detection (HBOS)

Histogram Based Outlier Score takes on the freedom of the function and measures the degree of deviations by histogram construction. A histogram can be calculated for each function, measured individually, and averaged at the end of multivariate anomaly detection.

One Class Support Vector Machine (OCSVM)

A One-Class Vector Support Machine is an unsupervised learning algorithm trained on ‘normal’ data, the negative examples in our case. It understands the borders of these points and is thus capable of classifying certain points beyond the boundary as outliers, you guessed.
It is difficult to train an unsupervised learning algorithm and the One-Class SVM is no exception. The nu parameter can be the proportion of outliers that you intend to see, the gamma parameter smooths up the contour lines.

Lets CODE:

I am using this amazing library PyOD for anomaly detection. It is a comprehensive and scalable Python toolkit for outlier detection in multivariate data. All credits to Zhao, Y., Nasrullah, Z. and Li, Z., 2019. PyOD: A Python Toolbox for Scalable Outlier Detection. Journal of machine learning research (JMLR), 20(96), pp.1–7.

After installing and updating PyOD, it's time to install some packages and modules.

I am randomly going to generate data with outliers using the generate_data() function in PyOD library. So, with the code below I can create 2000 records of random data with 2 features. Outlier_fraction suggests the fraction of outliers in the data generated. In our case, as outlier_fraction is 0.1 it means 200 records are outliers.

Next, I have the chosen 6 algorithms with some default parameters.

Now moving on to the model implementation.

Firstly, we fit the model and then predict the labels from the train data. These predicted labels are then compared with the actual labels to evaluate our model. The number of errors (both False Positives and False Negatives) in detecting the outliers is computed for respective models. Evaluation metrics like AUROC and P@n are computed using evaluate_print() from Py_OD.

Full code can be found here.