Alone in the Woods: Anomaly Detection using Isolation Forests

Published in

SFU Professional Computer Science

10 min readFeb 4, 2020

By: Archana Subramanian, Asha Tambulker, Carmen Riddel and Shreya Shetty

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/pmp}.

What are Anomalies and How do we Detect them?

The word anomaly means unusual; an unexpected change within data or an event that does not conform to the expected data pattern is considered an anomaly. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions [1, 2].

Example: Let us take a simple day to day scenario of a coffee shop to understand this better. Data from security cameras inside the shop is analyzed frequently and we can determine the number of people inside the shop at a given time. Suddenly, the number of customers increases and is at its highest from 4 to 5 pm. Even though this is a positive sign for the coffee shop, it is still an anomaly as this trend is different when compared to the rest of the day. The coffee shop owner can leverage this information to increase coffee supply during these hours to meet demands.

Photo by The Creative Exchange on Unsplash

Noise: Anomalies sometimes act as noise in the data set; that is, any undesirable or uninteresting factors that arise due to random errors. In many cases, we need to get rid of noise before proceeding with analysis [2].

BEWARE: Excessive noise in data may lead to wrong prediction of results.

Outliers: These are the ones which are significantly different from the remaining data set. Detecting unusual patterns in data benefits by enhancing the accuracy of tracking the changes in the trends of the domain under analysis and helps in restricting risky situations [3]. Examples of outliers could be unusual identifiable patterns of data seen in MRI scans that help detect the symptoms of disease, fraud detection, etc.

There are some areas where detecting anomalies can help identify a swindle activity [2, 3]:

In the banking sector — anomalies in card/online transactions can help detect theft of card/credentials.
In science — the identification of an anomalous point in space may lead to novel discoveries and theories.
In computer networks — to detect abnormality in the network traffic patterns to identify unauthorized access. It is also used to detect and classify spam email messages.
In the medical sector — anomaly detection is very critical in the areas of radiation oncology where the chances of errors are very rare, but when occurs can be fatal. It is used to identify tumours and trace logs of patients over a period of time.

Isolation Forests

There are many ways to determine if a given data point is an anomaly or not. Most anomalies can be detected by cluster analysis and detecting the micro clusters formed by them. Here, we concentrate on isolation forest — an anomaly detection algorithm that makes use of unlabelled test data set with the assumption that most of the instances in the data set are normal [4].

The isolation forest method is similar to the well-known random forest in principle. Like any tree ensemble method, isolation forests are built on the basis of decision trees, where points that can be isolated from other data are more likely to be anomalies [5]. We will discuss two different isolation forest algorithms; traditional and extended.

Why Isolation Forest?

Other anomaly detection methods such as support vector machines, decision trees, k-nearest neighbours and logistic regression require a labelled data set to form a classifier; finding a labelled data set can be difficult for some use cases. Isolation forests are generally used in an unsupervised manner.
Isolation forests use random partitioning which makes it easier to discover anomalies.
Isolation forests only require a few conditions to separate anomalies from normal observations when compared to other methods which use basic distance and density measures.
Isolation forests have low linear time complexity and a small memory requirement, eliminating major computational cost of distance calculation in all distance and density-based methods.
Isolation forests work nicely in a multi-dimensional feature space [4].

Applications of Isolation Forests

Automation detection of abusive content as spam or fake engagement [6].
Detecting network traffic and identifying system intrusion in an organization [6].
Anomaly detection in time-series data [2].
Credit card scam detection [3].
Determining anomalies in data center structure/schema [6].

Like other anomaly detection methods, isolation forests may result in swamping and/or masking. Swamping is the phenomenon in which normal values present in the data set can be considered as anomalies. Masking is the opposite where anomalous points are classified as normal. Both of these minor setbacks can be overcome by sub-sampling. Sub-sampling isolates the large chunks of training sample and works well when the sampling size is small where it can determine with greater accuracy if a point is normal or an anomaly [4].

How Does it Work?

The algorithm builds an ensemble of isolation trees (iTrees) for the given data set and anomalies are those instances which have short average path lengths on the iTrees. An iTree is a proper binary tree, where each node in the tree has zero or two daughter nodes [4].

Procedure for Each Tree using the Traditional Method

Two features are chosen randomly (one for x-axis, one for y-axis).
A single data point is chosen randomly.
The data are split by a random value between the maximum and minimum values of one of the selected features.
Trees are built based on the selected point’s position in regards to the split value; if the point is below or to the left of the split (or equal), the tree branches to the left. If the point is above or to the right of the split, the tree branches to the right.
Partitioning is repeated recursively until all instances are isolated or, if specified, a maximum tree height is reached.

In the image above the author chooses two points.

xᵢ — A normal (inlier) point which requires more partitions to be isolated.
x₀ — An anomaly which requires fewer partitions to be isolated.

The number of partitions required to isolate a point is equivalent to the path length from the root node to a terminating node, or if provided, reaches the maximum allowed height. Shorter paths indicate anomalies, while longer paths indicate normal points [4,7].

In the above figure, the height of the point along the red path is 3, indicating that the point was isolated quickly; it is considered anomalous. In contrast, the blue path shows the trajectory of a normal point.

Calculating Path Length and Anomaly Scores

Since iTrees have an equivalent structure to a binary search tree (BST), the estimation of average h(x) for external node terminations is the same as an unsuccessful search in a BST [4].

Given a set of n data points/instances, the average height of the instances, c(n), is:

Where H(n-1) is the harmonic number and is estimated by ln(n-1) + 0.5772156649 (Euler’s constant); this is used to normalize h(x).

The anomaly score s of an instance x is calculated as:

where E(h(x)) is the average of h(x) from a collection of isolation trees.

when E(h(x)) → 0, s → 1 (anomaly);
when E(h(x)) → n − 1, s → 0 (normal);
when E(h(x)) → c(n), s → 0.5 (neither anomaly or normal).

What Score Constitutes an Anomaly?

The original isolation forest publication [4] offers these words of wisdom:

If the score is very close to 1, the point is very likely an anomaly.
A score is much smaller than 0.5 indicates a normal observation.
If all scores are close to 0.5, then the data set likely does not have any anomalies.

This is quite vague and subjective. Your cut-off point may depend on the distribution of your data, but with a high-dimensional set, this may be difficult to determine. Scikit-learn’s isolation forest function includes a modifiable contamination parameter to indicate what proportion of data points should be classified as anomalies. The default is 0.1 [9]. At the end of the day, the threshold you use really depends on your use case.

Why Extended Isolation Forests?

Decision boundaries in traditional isolation forests are one-dimensional (either vertical or horizontal), parallel to the axes. There are regions that contain many branch cuts and only a single or few observations, which may mask some anomalies as normal (these are referred to as ghost regions) [8].

An example of ghost regions. The lighter coloured zones in the top right and bottom left quadrant of the rightmost plot indicate areas where data points would likely be classified as normal, even though we see in the leftmost plot that data are not present here. From [8]

Extended isolation forests mitigate the ghost region issue by using hyperplanes for splitting the data with random slopes. Hyperplanes have dimensions one less than the dimension of the data. This is difficult to visualize in high dimensions, but a 2-D example is shown below [8].
Variance between scores tends to be lower with extended isolation forests compared to traditional ones, making it easier to detect anomalies [8].

An example of a 2-dimensional hyperplane among 3-dimensional data. From (https://jovian.ml/aakashns/02-linear-regression)

In 2016, a traditional isolation forest function was introduced in the scikit-learn package [9]. Extended isolation forest functions were developed shortly thereafter, with the ability to change the hyperplane dimension in hopes of minimizing the presence of ghost regions.

Case Study

Here is an example of the extended isolation forest algorithm using this implementation (EIF). The original version of EIF was reported to be quite slow; however, the newest one released in November 2019 was converted to C++ using Cython and is now a lot faster. The EIF package is very easy to use; all that you need to run it are NumPy and Cython.

We tested out EIF on a simple data set of breast cancer masses with labelled anomalies (benign/non-cancerous mass = normal, malignant/cancerous = anomaly) and 9 features including mass radius, smoothness and concavity [10]. This data set has a large proportion of anomalies, but it nicely demonstrates EIF.

First, we loaded our data set into a Pandas DataFrame.

We then implemented EIF [11]. The ‘ExtensionLevel’ parameter indicates the dimension of the hyperplane to use, as discussed above. We can use anything from 0 to 8 (number of data set features minus 1). A level of 0 would be akin to using the traditional isolation forest function (remember those straight, perpendicular lines partitioning our data?). We used the maximum value (8) for this example. The ‘ntrees’ parameter refers to the number of trees our forest will consist of. 100 is a common choice as it was found that path lengths tend to converge at or before this point [4]. Lastly, ‘sample_size’ refers to how many samples we wish to use to build each tree. 256 is the most common choice, but as we have a small data set, we are using 64.

The plot below shows the anomaly score on the x-axis and the labelled class on the y-axis (0 for benign/normal, 1 for malignant/anomaly). As we can see, anomalous points generally have higher scores, which is what we were hoping for. As we are working with a labelled data set, we know exactly how many anomalies there are (239). We extracted the 239 records which had the highest score. Of these, approximately 95% were labelled as anomalies in the data set.

The results are by no means perfectly concrete; both swamping and masking were observed, as some normal points have higher scores than certain anomalies and some anomaly points have scores among the range of normal points.

In addition, the author of this post found disappointing results when using EIF on a larger data set.

A Note on Feature Selection

Although isolation forests are beneficial for working with high-dimensional data sets, we cannot ignore the importance of feature selection. In many real-world use cases, we may need to select features relevant to our expected result, instead of training a model on all available features. Feature selection is important as it reduces the complexity of the model, making it easier to train and improving model performance.

Conclusion

Even if anomalies are unusual or undesirable, they reveal a lot of important information which can help predict and prevent hazardous events or take advantage of profitable ones. Though there may be several algorithms to detect anomalies, isolation forests are simple and effective. The extended isolation forest algorithm is far from perfect; however, its efficiency and efficacy on high-dimensional data make it an important player in the anomaly detection game. Perhaps we will see even more improvements upon it in the future.

References

[1] https://blogs.oracle.com/datascience/introduction-to-anomaly-detection
[2] https://towardsdatascience.com/anomaly-detection-with-isolation-forest-visualization-23cd75c281e2
[3] Ounacer, S., et. al. (2018, December). Using Isolation Forest in anomaly detection: the case of credit card transactions. In Periodicals of Engineering and Natural Sciences. Vol 6(2): (pp 394–400).
[4] Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008, December). Isolation forest. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on (pp. 413–422). IEEE.
[5] https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e
[6] https://engineering.linkedin.com/blog/2019/isolation-forest
[7] https://quantdare.com/isolation-forest-algorithm/
[8] Hariri, S., Kind, M.C., & Brunner, R.J. (2018, April). Extended Isolation Forest. University of Illinois at Urbana-Champaign.
[9] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
[10] http://odds.cs.stonybrook.edu/breast-cancer-wisconsin-original-dataset
[11] https://github.com/sahandha/eif

Alone in the Woods: Anomaly Detection using Isolation Forests

Written by Carmen Riddel