Anomaly Detection Using Isolation Forest Algorithm

Published in

Analytics Vidhya

7 min readAug 8, 2020

Let’s start with a simple game — “Find Odd man out”.
I will show you an Image and you have to identify the odd object out in that image.

Ready?

That must have been easy. The Green lemon is the odd one out.
Let’s try the next one:

I bet, this would have been more difficult.

You might have taken more time to identify the emoji which has closed his mouth.

This odd man is nothing but the anomaly/outlier, which is different from the Others. For a smaller dataset, we did not face difficulty in identifying the anomaly, but when the scale is been increased both Human and Machine face difficulty in identifying that anomaly.

What is an Anomaly and how to identify it?

Anomalies are data points that are few and different. It has a pattern that appears to have different characteristics from a normal data point.

Examples of anomalies:
1) Fraudulent use of credit cards typically comes across as anomaly as the Fraudster might try transactions at new merchant or of very high values.
2) An attack on the computer network, etc

Approaches to detect anomalies:

Three fundamental approaches to detect anomalies are based on:

Density
Distance
Isolation.

We will look at them in more details below.

Approach based on Density and Distance

Most existing method for anomaly detection using the fundamental approach of density and distance is as follows :
1) Replicator Neural Network (RNN) — It is a neural network with feed-forward multi-layer perceptron which train the net to replicate the unseen normal instances relatively well. In RNN, those that are poorly reconstructed are deemed anomalies.
2) one-class SVM — It find the smallest region that contains most of the normal data points, the points outside of this region are deemed anomalies.
3) clustering-based methods, construct a profile of normal instances, then identify anomalies as those that do not conform to the normal profile.

Assumptions for Density and Distance measure anomaly detection:

For density measure, ‘Normal points occur in dense regions, while anomalies occur in sparse regions’. e.g : RNN and one-class SVM
For distance measures, ‘Normal point is close to its neighbors and anomaly is far from its neighbors’. e.g: Cluster based method.

However, their anomaly detection abilities are usually a ‘side-effect’ or by-product of an algorithm originally designed for a purpose other than anomaly detection which results in draw-back which are as follows:
a) Too many false alarms
ex: having normal instances identified as anomalies or too few anomalies being detected.
b) Many existing methods are constrained to low dimensional data and small data size
because of the legacy of their original algorithms.

Also, there are violations in the assumption for density and distance measure anomaly detection, e.g., high density and the short distance do not always imply normal instances; likewise, low density and long distance do not always imply anomalies.

To Tackle the drawback of Density and Distance measure anomaly detection, we have isolation forest.

Isolation Forest

Let’s understand in detail what isolation forest is and how it can be helpful in identifying the anomaly.

Isolation: The term isolation means ‘separating an instance from the rest of the instances’. Since anomalies are ‘few and different’ and therefore they are more susceptible to isolation.

What is Isolation forest?
The Isolation Forest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

It is an unsupervised algorithm and therefore it does not need labels to identify the outlier/anomaly.

It follows the following steps:

Random and recursive partition of data is carried out, which is represented as a tree (random forest). This is the training stage where the user defines the parameters of the subsample and the number of trees.
The end of the tree is reached once the recursive partition of data is finished. It is expected that the distance taken to reach the outlier is far less than that for the normal data (see the figure).
The distance of the path is averaged and normalized to calculate the anomaly score.
The judgment of the outlier is carried out on the basis of the score.

Below is the figure for the same:

Isolation Forest detects anomalies purely based on the concept of isolation
without employing any distance or density measure — fundamentally different from all existing methods.

Isolation forest exploits subsampling to achieve low linear time complexity and a small memory requirement. To deal with the effect of swamping and masking effects.

It takes advantage of two quantitative properties of anomalies:
1) They are the minority consisting of few instances
2) They have attribute-values that are very different from those of normal instances.
In other words, anomalies are ‘few and different’, which make them more susceptible to a mechanism we called Isolation.

Isolation forest builds an ensemble of isolation Trees for a given data set, anomalies are those instances which have short average path lengths on the isolation Trees.
The score which is closer to 1 is considered normal, whereas the score closer to 0 is considered an anomaly.
Isolation Forest has a linear time complexity with a small constant and a minimal memory requirement.
Isolation Forest is built specifically for Anomaly Detection.

Till now you might have got the good understanding of Isolation forest and Its advantage over other Distance and Density base algorithm.

Let’s get our hands dirty in identifying the anomaly using isolation forest in python.

Data description:
The data set which we are using is Http dataset from the KDDCUP in 1999.
The dataset contains 623091 HTTP connection records from seven weeks of network traffic.
The dataset is already preprocessed and contains 41 features of the individual TCP connections, content features, and traffic features.
Each connection is labeled as either normal or as an attack, with exactly one specific attack type.

Problem Definition:
Is to identify the connection, whether is normal or attack.

Python Code :

# Loading respective Library FIle 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use(“ggplot”)
%matplotlib inline# Loading respective Library FIle 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use(“ggplot”)
%matplotlib inline# Loading DATA
cols = [“duration”, “protocol_type”, “service”, “flag”, “src_bytes”, “dst_bytes”, “land”, “wrong_fragment”, “urgent”,
 “hot”, “num_failed_logins”, “logged_in”, “num_compromised”, “root_shell”, “su_attempted”, “num_root”, 
 “num_file_creations”, “num_shells”, “num_access_files”, “num_outbound_cmds”, “is_host_login”,
 “is_guest_login”, “count”, “srv_count”, “serror_rate”, “srv_serror_rate”, “rerror_rate”, “srv_rerror_rate”,
 “same_srv_rate”, “diff_srv_rate”, “srv_diff_host_rate”, “dst_host_count”, “dst_host_srv_count”, 
 “dst_host_same_srv_rate”, “dst_host_diff_srv_rate”, “dst_host_same_src_port_rate”, “dst_host_srv_diff_host_rate”,
 “dst_host_serror_rate”, “dst_host_srv_serror_rate”, “dst_host_rerror_rate”, “dst_host_srv_rerror_rate”]
kdd_data = pd.read_csv(“…./Issolation_forest_data/kddcup.data/kddcup.data.corrected”, sep=”,”, names=cols + [“label”], index_col=None)#Viewing The data.
kdd_data.head()# Dropping unwanted variable
kdd_data = kdd_data[kdd_data[“service”] == “http”]
kdd_data = kdd_data.drop(“service”, axis=1)
cols.remove(“service”)
kdd_data.shape

Output:

(623091, 41)## Encoding the categorical column 
from sklearn.preprocessing import LabelEncoder
encs = dict()
kdd_data_1 = kdd_data.copy() #.sample(frac=1)
for c in kdd_data_1.columns:
 if kdd_data_1[c].dtype == “object”:
 encs[c] = LabelEncoder()
 kdd_data_1[c] = encs[c].fit_transform(kdd_data_1[c])#### Implementing Isolation forest 
from sklearn.ensemble import IsolationForest
#### Spliting the data into Train, Test and validation dataset 
X_train, y_train = kdd_data_1[cols][:300000], kdd_data_1[“label”][:300000].values
X_valid, y_valid = kdd_data_1[cols][300000:500000], kdd_data_1[“label”][300000:500000].values
X_test, y_test = kdd_data_1[cols][500000:], kdd_data_1[“label”][500000:].values

Isolation forest parameter :
n_estimators: The number of trees to use. The paper suggests an number of 100 trees, because the path lengths usually converges well before that.
max_samples: The number of samples to draw while build a single tree. This parameter is called sub-sampeling in the paper and they suggest max_samples=256, since it generally provides enough details to perform anomaly detection across a wide range of data.
contamination: The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function. I will show you how to pick this treshold later.

### Building isolation forest 
iso_Forest = IsolationForest(n_estimators=100, max_samples=256, contamination=0.2, random_state=2018)## Fitting the model 
iso_Forest.fit(X_train)#### Ploting the graph to identify the anomolie score .
plt.figure(figsize=(12, 8))
plt.hist(scores, bins=50);

From the plot we clearly get the cluster under -0.2. Hence we can consider average path lengths shorter than -0.2 as anomalies.from sklearn.metrics import roc_auc_score
print(“AUC: {:.1%}”.format(roc_auc_score((-0.2 < scores), y_valid == list(encs[“label”].classes_).index(“normal.”))))

Output :

AUC: 99.0%

It is pretty good . Now Lets check on Test Dataset.

### getting score for test 
scores_test = iso_Forest.decision_function(X_test)
print(“AUC: {:.1%}”.format(roc_auc_score((-0.2 < scores_test),
 y_test == list(encs[“label”].classes_).index(“normal.”))))