Approaches to Anomaly Detection

Paul Truong
SafetyCulture Engineering
3 min readJan 20, 2020

Anomaly detection or outlier detection has been extremely well-known for applications to detect and/or remove anomalous observations from data in fraud detection, surveillance, medical diagnosis, data cleanup, and predictive maintenance. This branch of machine learning technology has not only gained tremendous fame in classical domains like banking, financial institutions, medical & pharmaceutical science, telecommunication but also emerged as a key player in many Internet of Thing (IoT) based applications for monitoring and predicting system failures.

Anomalies (outliers) are caused by human error, instrument error, natural deviation in populations, fraudulent behaviour, unexpected changes in behaviour or faults in system. The knowledge of the anomalies in data (e.g. labels, distribution) has a huge impact on selection of a suitable approach for developing the detection system. The following three approaches in terms of modelling strategy and algorithms could be considered based on the nature of data capturing anomalies [V. J. Hodge et al].

Unsupervised clustering

For data without prior knowledge, specially not pre-labelled for normal or abnormal data points, an unsupervised learning approach should be applied. This approach assumes the data has a static distribution which can be described by statistical models and flags the data points having values not within the approved range of the distribution as outliers.

Fundamental steps in unsupervised approach

The most popular ML algorithms applicable to this approach include K-mean clustering, proximity-based techniques (e.g. Gaussian/Elliptic Envelope), Isolation Forest (a class of the decision tree-based method), One-class Support Vector Machine (SVM).

Supervised Classification

This approach support modelling both normality and abnormality, thus requiring pre-labelled data that are tagged as normal or abnormal (or even specific known types of abnormal behaviours). Any supervised machine learning algorithms can be used for the scenario, therefore many consider this as a regular classification problem rather than anomaly detection one.

Semi-supervised detection

This approach focuses on modelling only normality, requiring either pre-classified data marked normal or assuming that the training data only contains normal data (only applicable to cases where abnormalities are rare events in a whole dataset). In the semi-supervised process, the normal pattern is taught to a supervised model and applying an unsupervised method to induce the boundary of normality. For time-series data, the time-series forecasting methods (e.g. regression techniques, ARMA, LSTM, etc) are often used for learning the normal data in the supervised step. The semi-supervised approach can be favourable for cases in which normal data is highly available but it’s very hard to obtain abnormal data, such as those in fault detection domains.

Fundamental steps in semi-supervised approach

Conclusion

No universal methodology for anomaly detection exists. The aforementioned approaches based on the data could provide a good start to develop a proper solution for detection goals. However, a full solution would require serious consideration of distribution model (based on data distribution), attribute types (feature engineering approaches), scalability (size of data and its evolution over time), speed (real-time ?), accuracy target and model storage capability.

--

--