Swamping and Masking in Anomaly detection: How Subsampling in Isolation Forests helps mitigate this ?
Isolation Forests achieve high anomaly detection performance with high efficiency with a very small subsampling size unlike most of the anomaly detection algorithms in which larger the sampling size better is the detection performance.
One of my recent tasks at work was to identify fraud patterns related to returns and refund of online purchase of items. The test dataset I was dealing with was close to 2 million customers and 40 features. The features available in the data included statistics related to value and volume of refund transactions, return product categories, reasons behind returns, distance between shipping and billing addresses, payment channels used for orders, orders and returns timestamp and few others. I tried unsupervised learning techniques like K-Means Clustering and DBSCAN (Density Based Clustering) and it was taking more than 20 mins to converge on my 16GB Ram MacbookPro.
Then I attempted Isolation Forests on the same dataset, and it was 10X faster. However, we will not talk much about the execution speed here. I will discuss how Isolation Forests converges faster with real time performance statistics in another blog. Apart from the speed of execution, I also observed high detection performance. This made me explore more on this algorithm.
There are already few articles/blogs on the working of the Isolation Forest algorithm. You can read more about it here and here. However, the subsampling feature which helps in reducing Swamping and Masking is not very much talked about. I have also included a brief explanation of Isolation Forest Algorithm in brief in section 3 of this blog.
In the sections below, I will introduce the problem of Swamping and Masking in Anomaly detection and how the Isolation Forest setup helps in dealing with these problems effectively. Most of the concepts explained in this blog are based on the original Isolation Forest Research Paper by Fei Tony Liu, Kai Ming Ting and Zhi-Hua Zhou.
1) What is Anomaly Detection?
Anomaly Detection (also outlier detection) is the identification of rare items, events or observations.
2) What is Swamping and Masking?
Anomalies are the rare events, and this makes it very difficult to label these with high accuracy. Swamping is the phenomenon of labelling normal events as anomalies.
When clustering algorithms are used, the data points belonging to different clusters gets merged into one cluster, if the number of segments (including outlier segments) in the dataset is not known. This causes the outlier cluster merged to a cluster with normal data points. Basically, the outliers are not detected. This is called Masking. Swamping and Masking are more common when dataset size is large. This will be evident from the example shared in the next section. Any anomaly detection technique should be robust against swamping and masking.
3) Isolation Forest Algorithm in Brief
Isolation Forest builds an ensemble of trees on the dataset based on the random cutoffs on the features selected at random. These trees isolate every point and the average path length by all the trees to isolate the points helps in identifying the anomalies. Shorter the average path, more anomalous the data instance is. Let us take an example of a dataset with two features x1 and x2. Isolation Forest follows basic principle of decision trees and will create splits on the values of x1 and x2 to define(isolate) the points. It can be observed from the below figure that a normal point N is isolated by having multiple splits on both the features x1 and x2. The path length is long.
However, an anomaly point A is isolated by just one split on x1.
Hence, a higher isolation score is assigned to A compared to N. I will try to discuss more on the calculation of the Isolation score in another blog in future.
The Average Path Lengths (Average number of splits required to isolate a point) converge with increase in the number of trees. In the above figure, the Anomaly Point (A) has shorter average path length compared to Normal point (N)
4) Swamping and Masking handled with Subsampling!!
Most of the anomaly detection algorithms work well with large number of data instances. Large training data helps in profiling the normal instances better and the instances not aligned to these profiles are labelled as outliers.
Isolation forests are not required to grow the tree to isolate every normal data instance as the anomaly data points are isolated much earlier than the normal data instances. Normal data instances constitute the majority of the training data points.
Isolation forests works best when small sampling size is used. When the sample size is large, the normal instances are too close to the anomalies. This is evident from the explanation below.
The number of partitions required by each isolation tree to isolate the anomalous instances increases when the sample size is large. This increases the average path lengths for all(anomalous and non-anomalous) the instances. It is more difficult and time taking in distinguishing the anomalous data points.
This causes Swamping (labelling normal instances as anomalies) and Masking ( existence of too many anomalies) . Using large number of records for identifying anomalies is the primary reason behind Swamping and Masking. Isolation Forest grows the individual trees on different subsamples of the data.
1) The subsamples used have less densely populated data points making it easier for detecting the anomalous points. There are very less number of data points which are neither clear cut anomalous nor non-anomalous. As the subsampling is random, the probability of excluding most of the points in this category is high making distinction between these two classes difficult .
2) Also, as each tree uses different subsamples, each individual tree also identifies different sets of anomalies.
As seen in the above figure, this subsample has two sets and one standalone anomalous instance. It was only possible to isolate the anomalous instances by removing the clutter and mix of normal and anomalous instances. The number of subsamples to be considered to grow every tree becomes a tuning parameter here in addition to number of trees. I will try to share the realtime performance of the algorithm with different sub samples in one of my next blogs.
I hope you got a good intuition about isolation forest algorithm and how it achieves high anomaly detection performance reducing swamping and masking by leveraging the subsampling technique.
5) References:
https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest
https://en.wikipedia.org/wiki/Anomaly_detection
http://www.matias-ck.com/files/papers/Extended_Isolation_Forest.pdf