Anomaly detection with isolation forest in scikit-learn

Published in

MLthinkbox

8 min readAug 14, 2022

This post will provide a quick summary of the key details to get started with isolation forests in scikit-learn and can serve as a platform for your future research.

Introduction to the isolation forest algorithm

Anomaly detection is a process of finding unusual or abnormal data points in a dataset. It is an important technique for monitoring and preventing fraud, as well as for detecting errors in data. Unfortunately anomalies can be very hard to detect given that: 1) they typically occur infrequently, 2) they can have no clear patterns of identification, and 3) abnormalities can be distinctly different even within the same dataset.

An isolation forest is one of the most popular algorithms for anomaly detection. The general idea of an isolation forest is that data anomalies (outliers) can be more easily separated (isolated) from the broader dataset given that they possess unique characteristics that are less likely to occur. In essence the algorithm turns the problem on its head by learning the characteristics of normal data in order to distinguish the abnormal data points. Given that there are no labels present in the data (abnormalities are generally hard to label), it is an unsupervised algorithm.

The algorithm works by building multiple decision trees to classify data points. Each decision tree will be built using different subsets of the input features, which are then combined to form a final decision tree. The isolation forest works by dividing the data space into “n buckets” as splits that are orthogonal to the origin (try and visualize dividing lines being drawn through the dataset) and will assign higher anomaly scores to data points that need only a few of these “splits” to be isolated. For more details on isolation forest you can view the scikit-learn documentation here.

Why use isolation forest for anomaly detection?

A summary of the benefits in using isolation forest for anomaly detection are as follows:

It is a popular algorithm with strong community documentation
It has a linear time complexity with arguably low computational resource requirements,
It does not utilize any distance or density measures which helps to keep the algorithm comparatively fast,
It seems to work well with high dimensional problems that may have a large number of irrelevant attributes.

Potential limitations in using isolation forest

There is a problem noted that the algorithm cannot identify local anomaly points, so accuracy can be affected. More details on this issue can be found here.
Given that it is an unsupervised algorithm it will be hard to know if you are indeed correctly identifying abnormalities, which will necessitate the support of key stakeholders within the problem domain.

Algorithm requirements and input parameters

The key requirements for this project are to have numpy, pandas, and scikitlearn installed on your machine. Additionally for the purposes of visually exploring data we will utilize matplotlib, seaborn and plotly express.

To get started lets inspect the parameters of the Isolation forest algorithm using the python help function.

from sklearn.ensemble import IsolationForest
help(IsolationForest)

Important parameters in the algorithms are:

number of trees / estimators : how big is the forest
contamination: the fraction of the dataset that contains abnormal instances, e.g. 0.1 or 10%. The algorithm is very sensitive to this parameter and it is incredibly useful to have an initial hunch on the expected rate of occurrence of abnormalities.
max samples: The number of samples to draw from the training set to train each Isolation Tree with.
max depth: how deep the tree should be, this can be used to trim the tree and make things faster.

Code example credit card fraud detection

In this example we will analyze a credit card transaction dataset for anomalies using the isolation forest algorithm. The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds = 1) account for 0.172% of all transactions and can be accessed from here courtesy of kaggle.

Import primary modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snspd.set_option("display.float", "{:.2f}".format)%matplotlib inline
sns.set_style("whitegrid")

load data and inspect columns

data = pd.read_csv("creditcard.csv")
data.head()

data['Class'].value_counts()

The dataset has 31 columns consisting of 30 potential features and 1 label column “Class” (anomalies=1). We also confirm that the class contains a very low number of anomalies (492).

Time feature exploration

The dataset contains a time column (in seconds) which could potentially influence how we approach the problem. There appears to be no visual relationship in the time distributions between fraudulent and non-fraudulent transactions.

plt.figure(figsize=(14, 12))
plt.subplot(2, 2, 1)
data[data.Class == 1].Time.hist(bins=35, color='blue', alpha=0.6, label="Fraudulant Transaction")
plt.legend()plt.subplot(2, 2, 2)
data[data.Class == 0].Time.hist(bins=35, color='blue', alpha=0.6, label="Non Fraudulant Transaction")
plt.legend()

Use PCA to try and visualize abnormalities

Using PCA we can visualize high-dimensional problems in a lower dimensional plane. From the scatter matrix we can see that the anomalies are tending to occupying specific regions within the scatters especially for components PC1 vs PC3. What we would want to see at the end of this exercise is that the isolation forest identifies similar regions once trained.

import plotly.express as px
from sklearn.decomposition import PCA
n_components=3
pca = PCA(n_components=n_components)
components = pca.fit_transform(data)total_var = pca.explained_variance_ratio_.sum() * 100labels = {str(i): f"PC {i+1}" for i in range(n_components)}
labels['color'] = 'Class'fig = px.scatter_matrix(
    components,
    color=data['Class'],
    dimensions=range(n_components),
    labels=labels,
    title=f'Total Explained Variance: {total_var:.2f}%',
)
fig.update_traces(diagonal_visible=False)
fig.show()

Data preparation

Additional transformations such a log normalization and min-max scaling are not required for isolation forests as this should have minimal influence on the number of ‘algorithimic slices’ required to isolate data points.

Splitting an isolation forest into train, test and validation sets is typically unnecessary given that in most cases labelled anomaly data will be unavailable on which to judge accuracy. In this example we will simply compare the predictions with the actual label for the entire dataset.

X = data.drop('Class', axis=1)
y = data.Classfrom sklearn.ensemble import IsolationForest
Iforest = IsolationForest(max_samples=100, 
                          random_state=1111,
                         contamination=0.05,
                         max_features=1.0,
                         n_estimators=100,
                         verbose=1,
                         n_jobs=-1)
Iforest.fit(X)

The algorithm identified 14241 data points as abnormal (approx 5%) which corresponds exactly with the contamination input parameter initially specified.

y_pred = Iforest.predict(X)
y_pred_adjusted = [1 if x == -1 else 0 for x in y_pred]
sum(y_pred_adjusted)Out: 14241

Evaluation

Given that we are focused on anomaly detection it is important to measure algorithm effectiveness based on precision or recall rather than accuracy.

For evaluation of the algorithm we will calculate metrics globally by counting the total true positives, false negatives and false positives. Sklearn has inbuilt functions to support computation of metrics for precision, recall, F-measure and support for each class.

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label a negative sample as positive.
The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. F1 = 2 * (precision * recall) / (precision + recall)

A precision of 88% in terms of detecting anomalies is however a very encouraging result and means that anomalous data is being accurately isolated by the algorithm.

from sklearn.metrics import precision_recall_fscore_support
precision_recall_fscore_support(y, y_pred_adjusted, average='macro')Out:(0.5140217751041775, 0.8862351802914121, 0.5148736994136962, None)

As we can see from the confusion matrix the model falsely predicts a large proportion of the values to be abnormalities, despite this not being the case according to the labels.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y, y_pred_adjusted)from sklearn.metrics import ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0, 1])
disp.plot()
plt.show()

Interestingly the algorithm identified very specific regions as abnormalities. While it this may appear to be overly aggressive allocation of abnormalities it seems reasonable.

data['predictions'] = Iforest.predict(X)
data['predictions'] = np.where(data['predictions']==-1,1,0)n_components=3
pca = PCA(n_components=n_components)
components = pca.fit_transform(data)total_var = pca.explained_variance_ratio_.sum() * 100labels = {str(i): f"PC {i+1}" for i in range(n_components)}
labels['color'] = 'Class'fig = px.scatter_matrix(
    components,
    color=data['predictions'],
    dimensions=range(n_components),
    labels=labels,
    title=f'Total Explained Variance: {total_var:.2f}%',
)
fig.update_traces(diagonal_visible=False)
fig.show()

The algorithm appears to focus on regions of lower density with high variance, which is exactly what we were expecting.

Isolation forest sensitivity to contamination parameter

I confirmed the sensitivity of the algorithm to the contamination factor after conducting a few consecutive modelling runs. A contamination factor of 2% (which is close to expected) yields a reasonable precision while ensuring that false positives are kept to a minimum.

So what next?

Having a good grasp of expected rates of abnormalities is really important in order to get a good balance between model precision and minimizing false positive rates.

Next steps would probably involve deploying the model in a production environment. The algorithm would begin to detect anomalies and model assumptions could be refined through continuous testing.

The combination of isolation forest within an ensemble of unsupervised anomaly detection algorithms might be the best way to go, however further research is required on my part. Hope this helps and good luck with your research.

You can access the code example here.

References: