Anomaly Detection In High Dimensional Categorical Unlabeled Data

Although there are situations where simple and traditional clustering(distance/density based) models could be used, applying those models to “Anomaly Detection” is actually not very useful and wise in real world applications.

--

Photo credits: Unsplash

Key Takeaways from this post

  • Understand key challenges in applying popular ML models to un-labelled (unsupervised) anomaly detection problem
  • Understand which models can/cannot be applied
  • Identify ways to overcome those challenges

Challenges with Categorical Data

Most machine learning models and frameworks (Scikit-learn, TensorFlow, PyTorch) work on numerical values. And most categorical data (e.g., January, February, Sunday, Monday etc.,) is in the form of strings (but not necessarily).

In order to use the machine models, the usual way is to convert the categorical data into numeric labelled data or one-hot encoded data etc.,

Let's consider a small sample categorical data with three features each one being a categorical one:

One hot encoding ( dummy variable columns) of the above data might look something like below:

Challenges With Distance/Density based Models

Some of the distance/density based unsupervised models are K-means clustering, Gaussian mixture models, DBSCAN etc.,

The main issue with categorical values is that distance (between any two values) is not meaningful and undefined. Even statistical measures like mean, standard deviation (which form the basis for many distribution based models) cannot be applied to categorical values.

Solution

A method that particularly worked (tried on a real world dataset) to deal with categorical values in the context on unsupervised anomaly detection is CompreX proposed in this paper. The main idea in this is to use minimum description length (MDL) from information theory. Intuition behind this principle is that any data observations could be compressed and encoded with minimum coding length. Observations which don't compress will do not fit well into that dataset.

MDL Formal Definition

Formally MDL can be stated as below:

Intuitively, the above statement means that MDL tries to find a best model which can minimize the overall encoding (length) of the data as well as the encoding of the model itself. The term "model" is used loosely here. In this particular context it refers to encoding of features and their values in the form of code tables using Shannon Entropy to calculate optimum code length.

This can be effectively used to identify those observations which don't compress well based on their encoding and to classify them as anomalies.

One good aspect about this model is that the data doesn't need to be converted to numerical values as it can be directly applied to categorical (string labels) values. It also deals with correlation between categorical features in the data.

To illustrate how the model works we can take a look at the table (taken from paper on Comprex) below:

As seen above, more frequent patterns get small code lengths and therefore small anomaly scores. Anomaly scores are nothing but the total encoding length of each data point.

Code

Code can be found at https://github.com/HamedMP/CompreX. However, this might have to be changed slightly (as I found some issues with my dataset). Also performance could be optimized when there the data has lots of features and data points.

Conclusion

I hope this post has been helpful and I want to hear about any suggestions or uses for this this model has not worked. The limitation of this model is that it only works on categorical features. I believe research is under way to combine continuous features and categorical features to solve unsupervised anomaly detection.

I have also tried some approaches such as autoencoders based on reconstruction error, but in most of these I found lots of false positives rendering them to be ineffective.

References

  1. https://eda.mmci.uni-saarland.de/pubs/2012/comprex-akoglu,tong,vreeken,faloutsos.pdf
  2. https://stattrek.com/multiple-regression/dummy-variables.aspx
  3. https://www.sciencedirect.com/topics/computer-science/minimum-description-length
  4. https://arxiv.org/ftp/arxiv/papers/1405/1405.2061.pdf#:~:text=Meaning%20of%20Entropy,of%20information%20in%20that%20variable.

Vice President, Data scientist, Architect, Data Science Manager, Thought leader, Keynote speaker at various webinars View all posts by Prasanna Sagar Maddu

Published

Originally published at http://datassience.wordpress.com on December 12, 2021.

--

--

Prasanna Sagar Maddu
Practical and Real World Machine Learning

Vice President — Data Scientist Leader, Data Strategy, Applied ML and Machine Learning/Deep Learning/Python Expert