Anomaly Detection with Isolation Forest

Kyle Hammerberg
8 min readMar 28, 2023

Anomaly detection is an important task in a variety of industries. One popular algorithm for anomaly detection is the Isolation Forest. Isolation Forest is a tree-based approach that is relatively simple to understand and straightforward to implement with the python’s scikit-learn package.. In this article, we’ll provide a step-by-step tutorial on how to implement the Isolation Forest algorithm for anomaly detection on the Breast Cancer Wisconsin dataset, which is a publicly available dataset that contains labeled examples of malignant vs. benign tumors.

Overview of Isolation Forest

The Isolation Forest algorithm is an unsupervised learning technique used for anomaly detection. It is based on the concept of isolating anomalies in a dataset by building multiple random trees, called isolation trees. It was introduced by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou in their 2008 paper “Isolation Forest.”

The core idea behind Isolation Forest is that anomalies can be easily isolated because they have different attribute values compared to normal instances. In the algorithm, each tree is constructed by recursively splitting the dataset using randomly selected features and values until every instance is isolated.

Here are the key elements of the Isolation Forest algorithm:

  1. Random partitioning: At each node of an isolation tree, the algorithm selects a random feature and a random split value within the range of the selected feature. The dataset is then split into two subsets based on the split value.
  2. Recursive tree construction: The random partitioning process is repeated recursively until all instances are isolated, or a maximum depth limit (typically set as log2(n), where n is the number of instances) is reached.
  3. Path length: The path length from the root node to the terminating node is a measure of how easily an instance can be isolated. Anomalies tend to have shorter path lengths because they can be isolated more quickly than normal instances.
  4. Ensemble of isolation trees: The algorithm constructs an ensemble of isolation trees, usually several hundred trees, to improve detection performance. Each tree is built on a subsample of the dataset.
  5. Anomaly score: For each instance, the average path length across all trees in the forest is calculated. The anomaly score is then computed by normalizing the average path length. Instances with higher anomaly scores are considered more likely to be anomalies.

The sklearn implementation of the Isolation Forest algorithm can be found in the sklearn.ensemble.IsolationForest class. It provides an easy-to-use interface for training and predicting with the algorithm, as well as options for adjusting various parameters, such as the number of trees, subsampling size, and random seed.

The algorithm works as follows:

  1. Randomly select a feature and a split point for the data.
  2. Partition the data into two subsets based on the selected feature and split point.
  3. Repeat steps 1 and 2 recursively until each data point is isolated in its own subset.
  4. The isolation score for each data point is the number of splits required to isolate it.
  5. Anomalies are identified as data points with a high isolation score.

The Isolation Forest algorithm has several advantages over other anomaly detection algorithms, such as being insensitive to the distribution of the data and being able to handle high-dimensional datasets efficiently. It also has a few hyperparameters that can be tuned, such as the number of trees in the forest and the maximum depth of each tree.

Tutorial: Anomaly Detection with Isolation Forest

Now, let’s walk through a step-by-step tutorial on how to implement the Isolation Forest algorithm for anomaly detection on the Breast Cancer Wisconsin dataset.

Step 1: Import the Required Libraries

We’ll start by importing the required libraries for loading the dataset, preprocessing the data, and implementing the Isolation Forest algorithm. In this tutorial, we’ll use Pandas, NumPy, Scikit-learn, and Matplotlib.

Step 2: Load and Preprocess the Dataset

Next, we’ll load the dataset and assign our target (dependent) variable to y and the features (independent variables) to X. Finally, we’ll set aside 30% of the dataset to validate the model (e.g., establish a test set).

Step 3: Train the Isolation Forest Model

We’ll now train the Isolation Forest model on the training set using the default hyperparameters and then generate predictions. The np.where() function will be used to code the prediction output to match how the original dataset codes benign vs. malignant observations klearn returns a 1 for inliers and 0 for outliers, but the breast cancer dataset codes

Next, we’ll generate a confusion matrix and print the classification report to gauge the model’s performance. The code snippet below uses seaborn’s heatmap() function to create a nicely annotated visualization of the confusion matrix. The annot parameter is set to True to display the actual values in each cell. The fmt parameter is set to 'd' for integer formatting, and the cmap parameter is set to 'coolwarm' for a blue-to-red color scale. We disable the colorbar by setting the cbar parameter to False. Finally, we customize the labels using the xticklabels, yticklabels, xlabel, ylabel, and title properties.

The confusion matrix above shows that, in identifying malignant tumors, we had 11 true positives, 8 false positives, 100 true negatives, and 52 false negatives.

Here’s a brief summary of each metric in the classification report:

  1. Precision: The proportion of true positive predictions among all positive predictions. High precision indicates that when the model predicts a positive class, it is usually correct.
  2. Recall (also known as Sensitivity or True Positive Rate): The proportion of true positive predictions among all actual positive instances. High recall indicates that the model is good at identifying positive instances.
  3. F1-score: The harmonic mean of precision and recall. It provides a single score that balances both precision and recall. A high F1-score indicates that both precision and recall are high.
  4. Support: The number of instances for each class in the true labels. It gives an idea of the class distribution in the dataset.
  5. Accuracy: The proportion of correct predictions among all predictions. It is a widely used metric, but can be misleading if the class distribution is imbalanced.
  6. Macro average: The average of a metric across all classes, without considering the class size. It gives equal weight to each class.
  7. Weighted average: The average of a metric across all classes, considering the class size (support). It is a better measure when dealing with imbalanced datasets.

Now, let’s interpret the results:

Class 0 (Malignant):

  • Precision: 0.58, meaning that when the model predicts an instance as malignant, it is correct 58% of the time.
  • Recall: 0.17, meaning that the model correctly identifies only 17% of the malignant instances.
  • F1-score: 0.27, a low score indicating that the model is not performing well for the malignant class.

Class 1 (Benign):

  • Precision: 0.66, meaning that when the model predicts an instance as benign, it is correct 66% of the time.
  • Recall: 0.93, meaning that the model correctly identifies 93% of the benign instances.
  • F1-score: 0.77, a relatively good score, indicating that the model performs better for the benign class.

Overall performance:

  • Accuracy: 0.65, meaning that the model is correct 65% of the time. However, accuracy alone may not be a reliable metric due to potential class imbalance.
  • Macro average F1-score: 0.52, suggesting average performance across both classes.
  • Weighted average F1-score: 0.58, which is slightly better than the macro average, indicating that the model performs better on the more common benign class.

In summary, the model performs better at identifying benign instances (class 1) than malignant instances (class 0). The overall performance is not very high, suggesting that there is room for improvement.

Now, let’s try to improve the model by adjusting the contamination parameter. This parameter represents the proportion of outliers we expect in the dataset. Since we know the true labels, we can estimate the proportion of malignant cases in the training set:

Finally, let’s visualize the results. We’ll create a confusion matrix heatmap for both the initial and the improved model:

It’s clear immediately that our model is much better at identifying our target Malignant class. Let’s take a look at the classification report for the new model:

Comparison:

Class 0 (Malignant):

  • Precision remains the same in both models (0.58).
  • Recall improves significantly from 0.17 to 0.65.
  • F1-score also improves significantly from 0.27 to 0.61.

Class 1 (Benign):

  • Precision improves from 0.66 to 0.78.
  • Recall decreases from 0.93 to 0.72.
  • F1-score decreases slightly from 0.77 to 0.75.

Overall performance:

  • Accuracy improves from 0.65 to 0.70.
  • Macro average F1-score improves from 0.52 to 0.68.
  • Weighted average F1-score improves from 0.58 to 0.70.

The new model shows a significant improvement in the classification of malignant instances (class 0), with a higher recall and F1-score. The precision for benign instances (class 1) has also improved, but the recall and F1-score for class 1 have slightly decreased.

Overall, the new model demonstrates better performance across all metrics, with higher accuracy, macro average, and weighted average scores. This indicates that the new model is likely to be a better classifier than the previous one.

Although we improved the model’s performance to some extent, there may still be room for further improvements. Possible paths forward include:

  • Fine-tuning the model parameters (e.g., the number of estimators or the maximum samples) to find an optimal configuration.
  • Feature engineering to improve the representation of the data, such as removing irrelevant features, normalization, or dimensionality reduction.
  • Ensuring the model is not overfitting by using techniques such as cross-validation, regularization, and monitoring performance on a separate validation set.

Apart from the Isolation Forest algorithm, there are other anomaly detection algorithms that might perform better depending on the dataset and the specific problem:

  • Local Outlier Factor (LOF): This method calculates the local density deviation of instances with respect to their neighbors, identifying instances that have lower density than their surroundings as outliers.
  • One-Class Support Vector Machine (OCSVM): A variation of the standard SVM, this algorithm is trained on only one class and aims to separate the instances of that class from the origin in feature space, detecting anomalies as instances that fall outside the separation boundary.
  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise): A clustering algorithm that groups instances based on density, treating sparse regions as noise or outliers.
  • Autoencoders: Neural networks that learn to reconstruct input data, where the reconstruction error can be used to measure the “outlierness” of instances. High reconstruction errors are often associated with anomalies.

In the next article in our series on anomaly detection, we will walk through the process of building an autoencoder using TensorFlow and Keras to detect credit card fraud transactions. I hope you enjoyed!

[1] https://scikit-learn.org/stable/index.html

[2] https://www.researchgate.net/publication/224384174_Isolation_Forest

--

--