Isolation Forest Anomaly Detection — Identify Outliers

Young Yoon
12 min readMar 8, 2022

--

Outline

In this blog, I explain and implement Isolation Forest. My goal is to explain using plain English so that non-technical readers can understand the algorithm.

This post includes the following topics:

  • Why and how to look for outliers.
  • How Isolation Forest works.
  • The benefits and drawbacks of Isolation Forest.
  • The implementation of Isolation Forest in Python.

You can find my code on GitHub. If you have any comments or suggestions, email me at y.s.yoon@berkeley.edu.

Let’s get started!

Introduction

I detect outliers using the Isolation Forest method. I use US public firm data, which are also used in my UC Berkeley Haas PhD Dissertation (Yoon 2022). Although I detect anomalies (outliers) to treat them before I conduct data analyses, the anomaly detection technique can be applied to many business settings, such as detecting fraudulent credit card spending.

Figure 1 shows US public firms’ features (characteristics) in 2-dimensions. The goal of this notebook is to detect outliers, as shown in red in Figure 2.

Why and how to look for outliers

Many machine learning algorithms and regression models are susceptible to outliers. An outlier is a data point that significantly deviates from other points. Unless they are properly taken care of, the inferences obtained from statistical models conducted may not be useful.

There are many popular methods to detect outliers, namely, the Z-Score and Interquartile Range methods. These methods are effective when the underlying data follows a normal distribution (a distribution where most data points are closer to the mean and become less frequent as the distance to the mean increases). However, if the data is not normally distributed, then these methods may incorrectly classify normal observations as outliers. On the other hand, the Isolation Forest method is non-parametric, which simply means that we don’t have to make assumptions about how the underlying data is distributed.

Furthermore, the Z-Score and Interquartile Range methods identify at the variable level. If you have reason to believe that multiple variables interact with each other and create outliers, these methods will not be able to detect those outliers. For example, an SAT score of 1350/1600 (90th percentile) does not seem to be an outlier by itself. However, if we introduce another dimension, age, and find that a 12-year-old got 1350/1600, then this observation is likely an outlier for a sub-sample of 12-year-olds. Unlike single-variable outlier detection methods, Isolation Forest detects outliers in multi-dimensional space.

Isolation Forest

Isolation Forest is a tree ensemble method of detecting anomalies first proposed by Liu, Ting, and Zhou (2008). Unlike other methods that first try to understand the normal points and classify abnormal points as anomalies, Isolation Forest explicitly isolates anomalies.

Anomalies have two characteristics. They are distanced from normal points and there are only a few of them. The Isolation Forest algorithm exploits these two characteristics.

Plain English

Isolation Forest randomly cuts a given sample until a point is isolated. The intuition is that outliers are relatively easy to isolate. Take a look at the following GIF.

It took 4 times to randomly cut the sample and isolate the red point, which is clearly an outlier.

Now, take a look at the next GIF, which attempts to cut the sample until the yellow point (normal point) is isolated.

This time, the algorithm took a lot more cuts.

As you can infer from the above, a data point is likely an outlier if it can be isolated only with a few random sample cuts.

Step by Step

Here are the steps involving the Random Forest algorithm.

First, the algorithm creates an isolation tree by going through the following steps:
[1] Randomly select a sub-sample (Sci-kit learn’s default: 100 instances/data points)
[2] Select a point to isolate.
[3] Randomly select a feature (i.e., variable) from the set of features X.
[4] Randomly select a threshold between the minimum and the maximum value of the feature x.
[5] If the data point is less (greater) than the threshold, then it flows through the left branch of the tree (right). In other words, define the new minimum (maximum) of the range to the threshold for the next iteration.
[6] Repeat steps 3 through 5 until the point is isolated or until a pre-defined max number of iterations is reached.
[7] Record the number of times the steps 3 through 5 were repeated.

Prediction process: Isolation Forest is created by computing the following score based on a collection of trees (like 100 trees).

where E[h(x)] is the average number of successful iterations for firm x and c(n) is the average iterations for unsuccessful iterations.

Benefits and drawbacks of using Isolation Forest

Benefits

As I noted above, Isolation Forest does not assume normal distribution and is able to detect outliers at a multi-dimensional level. More importantly, Isolation Forest is computationally efficient: the algorithm has a linear time complexity with a low constant and a low memory requirement. Therefore, it scales well to large data sets.

According to Liu, Ting, and Zhou (2008), Isolation Forest performs better than Random Forest, especially in large data sets.

Drawbacks

As I will discuss more in the implementation step, the Isolation Forest algorithm requires us to pick the percentage of anomalies in the dataset. Thus, we need to have at least some idea of the proportion of anomalies in our data.

Second, axis-parallel splits create some artificial normal regions. I won’t go into details, but this issue is addressed by the follow-up study Hariri, Kind, and Brunner (2021). And here are more resources: GitHub and blog. I will post a blog on this topic when I get a chance.

Implementation of Isolation Forest to Detect Outliers in Python (Scikit-learn)

1. Preparation

1.1. Input data

The data “sample_detect_outliers.csv” contains US public firms’ features that are related to leases (my dissertation is on leases).

1.2. Output data

The following lines of code will output an indicator variable that equals 1 if the firm (e.g., observation) is an outlier and 0 otherwise. For more details on the steps after I identify outliers, please see my dissertation, Yoon (2022).

1.3. Feature (i.e., variable) definition

  • lag_lease: prior year’s lease activity
  • lag_market_value: prior year’s market capitalization of the stock
  • lag_dividend: an indicator value that equals 1 if the company paid any dividends in the prior, and 0 otherwise
  • lag_loss: an indicator value that equals 1 if the company reported negative profits in the prior year
  • lag_cash: prior year’s cash balance
  • lag_tax_rate: effective tax rate in the prior year
  • lag_big4_auditor: an indicator value that equals 1 if the company hired a Big 4 auditor in the prior year

1.4. Import libraries and the data set

# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Import the data set
sample_with_outliers = pd.read_csv('sample_detect_outliers.csv')

2. Conduct an Exploratory Data Analysis (EDA)

2.1. Show the first 5 entries in the data

# First 5 entries
sample_with_outliers.head()

2.2. The # of entries and the # of features

There are 6,279 entries and 9 variables = 1 identifier + 1 label (i.e., outcome variable) + 7 firm features. The second variable, lease, is the label.

# (Number of entries, Number of features)
print(sample_with_outliers.shape)
(6279, 9)

2.3. Empirical distributions and histograms

There are many interesting observations that are worth noting. First, there are no missing values as indicated by counts that all equal 6,279. I discuss other interesting observations in the next section.

# Show empirical distributions
sample_with_outliers.describe()
# Show histograms - all variables except for the identifier
sample_z_score = sample_with_outliers.drop(columns='identifier')
sample_z_score.hist(bins = 20, figsize =(20, 10))
plt.show()

2.4. Drop variables without outliers

The above two panels show that lag_loss and lag_big4_auditor are indicator variables and thus do not have outliers.

In addition, the histogram suggests that lag_tax_rate also does not have outliers. During the sample period (2016–2019), the corporate tax rate was reduced from 35% to 21%. Two big spikes in the histogram around 35% and 21% and all other values smaller than 35% make economic sense.

Therefore, I drop these three variables that do not have outliers.

# Drop identifier, lag_loss, lag_big4_auditor, and lag_tax_rate 
var = ['identifier', 'lag_loss', 'lag_big4_auditor', 'lag_tax_rate']
sample_z_score = sample_with_outliers.drop(columns=var)

3. Z-Score — Detect and Remove Outliers

Before implementing Isolation Forest, I will first attempt to detect outliers using the Z-score method.

As indicated in a previous section, the Z-Score method is effective in addressing outliers for data points that follow a normal distribution.

The z-score indicates the distance of a data point from the mean as the number of standard deviations. The formula is as follows:

I am going to assume that observations with a Z-score below -2.5 or above 2.5 (i.e., 2.5 standard deviations away from the mean; 1% of the sample) are outliers.

3.1. lag_market_value — Identify and remove outliers

The histogram above shows that lag_market_value follows a normal distribution. To detect outliers, I first write a function to print the upper limits and lower limits of the Z-Score.

# Create a function to report the limits of the Z-Score
def print_z_score_limits (df, column_name):
""" Print the upper and lower limits of the Z-score """

# Compute the limits
upper_limit = df[column_name].mean() + 3 * df[column_name].std()
lower_limit = df[column_name].mean() - 3 * df[column_name].std()

# Round and return the limits
upper_limit = round(upper_limit, 2)
lower_limit = round(lower_limit, 2)
print_this = "Variable Name: " + column_name + " | Upper limit: " + str(upper_limit) + " | Lower limit: " + str(lower_limit)
return(print_this)
# Print the upper and lower limits
print_z_score_limits(sample_z_score, "lag_market_value")
'Variable Name: lag_market_value | Upper limit: 12.82 | Lower limit: 2.43'

It turns out that all of the values (N=6,279) are within the boundary values of 2.43 and 12.82. Thus, none of the observations are trimmed.

# Filter outliers
sample_z = sample_z[(sample_z['lag_market_value'] >= 2.43) | (sample_z['lag_market_value'] <= 12.82)]
print(sample_z.shape)
(6279, 5)

I’ll drop lag_market_value since an outlier treatment is not necessary for this feature.

# Drop lag_market_value
sample_z = sample_z.drop(columns=['lag_market_value'])

3.2. Log transformation of other variables

Going back to the histograms above, we can see that lease, lag_lease, lag_dividend, and lag_cash are all significantly right-skewed. In this case, the Z-Score method or many other popular outlier detection methods such as the Interquartile Range (IQR) method won’t do any good. To address this issue, I conduct log transformations on these variables to see if I can describe them with normal distribution.

The histogram also shows that the four variables have many zeros (or very small values), which make economic sense. Therefore, I will only look for outliers on the right-hand side of the distribution.

First, I will replace zeros with NaNs. This is okay because zeros will not be considered outliers.

# Replace zeros with NaNs
sample_z['lag_dividend'] = sample_z['lag_dividend'].replace([0],np.NaN)
sample_z['lease'] = sample_z['lease'].replace([0],np.NaN)

Next, I perform the log transformations and plot histograms

# Create a function to conduct log transformation
def log_transformation_function (df, column_name):
""" Conduct a log transformation of a variable """
# Replace the values with log-transformed values
df[[column_name]] = df[[column_name]].apply(np.log)
# Conduct log transformation on all the variables
for column in sample_z:
log_transformation_function(sample_z, column)

# Plot histograms
sample_z.hist(bins = 20, figsize =(20, 10))
plt.show()

3.3. Other variables — Identify and remove outliers

The distributions now look much more like normal distributions. Once again, I will use the Z-Score to identify outliers.

First, I report the Z-Score upper limits for each variable.

# Print the upper and lower limits
for column in sample_z:
print(print_z_score_limits(sample_z, column))
Variable Name: lease | Upper limit: 3.55 | Lower limit: -3.64
Variable Name: lag_lease | Upper limit: 3.42 | Lower limit: -3.66
Variable Name: lag_dividend | Upper limit: -0.8 | Lower limit: -6.86
Variable Name: lag_cash | Upper limit: 1.97 | Lower limit: -6.76

Next, I report the maximum values of each variable.

# Print the maximum values
print("MAXIMUM VALUES")
print(round(sample_z.max(),2))
MAXIMUM VALUES
lease 3.54
lag_lease 2.76
lag_dividend -1.59
lag_cash -0.04
dtype: float64

All of the maximum values are smaller than the upper limits. Thus, none of the variables seems to have outliers.

4. Isolation Forest — Multi-dimensional Outlier Detection

Although the data points do not seem to have outliers at the variable level, there could be outliers at a multi-dimensional level. Therefore, I employ Isolation Forest to detect outliers.

4.1. Setup

I begin by dropping the identifier from the original sample.

sample_isf = sample_with_outliers.drop(columns='identifier')

4.2. Conduct Principal Component Analysis (PCA)

I conduct PCA to reduce the firm feature dimensions from 7 to 2. Note that this step is not necessary because Isolation Forest works fine with multi-dimensions. Regardless, I reduce dimensions to visualize the outlier points in my data.

# Standardize features
sample_scaled = StandardScaler().fit_transform(sample_isf)
# Define dimensions = 2
pca = PCA(n_components=2)
# Conduct the PCA
principal_comp = pca.fit_transform(sample_scaled)
# Convert to dataframe
pca_df = pd.DataFrame(data = principal_comp, columns = ['principal_component_1', 'principal_component_2'])
pca_df.head()

4.3. Train the model and make predictions

As indicated before, we need to pre-define outlier frequency. After experimenting with data, I decide to use 4%.

# Train the model
isf = IsolationForest(contamination=0.04)
isf.fit(pca_df)
# Predictions
predictions = isf.predict(pca_df)

4.3. Extract predictions and isolation scores

# Extract scores
pca_df["iso_forest_scores"] = isf.decision_function(pca_df)
# Extract predictions
pca_df["iso_forest_outliers"] = predictions
# Describe the dataframe
pca_df.describe()

Let’s replace “-1” with “Yes” and “1” with “No”

# Replace "-1" with "Yes" and "1" with "No"
pca_df['iso_forest_outliers'] = pca_df['iso_forest_outliers'].replace([-1, 1], ["Yes", "No"])
# Print the first 5 firms
pca_df.head()

4.4. Plots

Plot the firms in the 2-dimensional space in the following order.
[1] All firms
[2] Normal Firms vs. Outlier Firms
[3] Isolation Forest Scores

# Create a function to plot firms on the 2-dimensional space
def plot_firms (dataframe, title, color = None):
""" Plot firms on the 2-dimensional space """

# Generate a scatter plot
fig = px.scatter(pca_df, x="principal_component_1", y="principal_component_2", title=title, color=color)

# Layout
fig.update_layout(
font_family='Arial Black',
title=dict(font=dict(size=20, color='red')),
yaxis=dict(tickfont=dict(size=13, color='black'),
titlefont=dict(size=15, color='black')),
xaxis=dict(tickfont=dict(size=13, color='black'),
titlefont=dict(size=15, color='black')),
legend=dict(font=dict(size=10, color='black')),
plot_bgcolor='white'
)

return(fig)
# Need to import renderers to view the plots on GitHub
import plotly.io as pio
# Plot [1] All firms
plot_firms(pca_df, "Figure 1: All Firms").show("png")
# [2] Normal Firms vs. Outlier Firms
plot_firms(dataframe=pca_df, title="Figure 2: Normal Firms vs. Outlier Firms", color='iso_forest_outliers').show("png")
# [3] Isolation Forest Scores
plot_firms(dataframe=pca_df, title="Figure 3: Isolation Forest Scores", color='iso_forest_scores').show("png")

4.6. Observations

A few observations are in order.

According to Figure 2, most of the outer points are identified as outliers. These outliers correctly meet the outlier characteristics that they are “distant and few.”

5. Export and conclude

# Add identifiers and cluster assignments (labels) to the sample
pca_df = pd.concat([sample_with_outliers['identifier'], pca_df], axis=1)
# Print the first 5 firms
pca_df.head()
# Export the sample as a csv file
pca_df.to_csv('outliers_detected.csv')

References

[1] Liu, Ting, and Zhou (2008) Isolation Forest
[2] Fuertes (2018) Isolation forest: the art of cutting off from the world
[3] Lewinson (2018) Outlier Detection with Isolation Forest
[4] Akshara (2021) Anomaly detection using Isolation Forest — A Complete Guide

--

--