Day — 23: 30 Days Machine Learning Projects Challenge;

Anomaly Detection Using Isolation Forest and Local Outlier Factor1️⃣1️⃣–11️⃣1️⃣

Abbas Ali
3 min readMar 25, 2024
Photo by Kankan on Unsplash

Hey there!👋

Today, I was again learning about outliers.

I used two different Outlier detection techniques to identify outliers from a sample NumPy array. This article aims to make you understand how to implement the outlier detection techniques to identify the outlier in our data in a very simple way.

I am not gonna go into depth about the techniques, because I don’t know that myself. I am just gonna show you the code. You will be able to understand it very easily.

Let’s get into the code.

I hope you already know what an outlier is. If you don’t know what it is, go and check my 30-day challenge library, in one of my articles I have clearly explained what it is.

30 Days Machine Learning Projects Challenge

32 stories

First, let’s import the necessary libraries.

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
import matplotlib.pyplot as plt

As you can see from the above code, the two techniques that we are going to use to detect the outliers are Isolation Forest and Local Outlier Factor. We need NumPy to create sample data points, and we need Matplotlib to visualize our data points.

# Generate more data points
np.random.seed(42)
X = np.concatenate([X, np.random.normal(loc=0, scale=3, size=(100, 2))])

The above code will create 100 random data points in 2 dimensions. Just print X and check out how it looks.

Now, let’s see how our data looks like in scatter plot.

# Plot the original data points
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c='blue', label='Original Data')
plt.title('Original Data Points')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()
Can you see the Anomaly(Outlier)

So, these are our data points.

Now, let’s use the first technique(Isolation Forest) to identify the outlier and also visualize it.

# Apply Isolation Forest
clf = IsolationForest(n_estimators=20, contamination=0.001)
clf.fit(X)
outliers_isf = clf.predict(X)

# Plot outliers detected by Isolation Forest
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c='blue', label='Inliers')
plt.scatter(X[outliers_isf == -1, 0], X[outliers_isf == -1, 1], c='red', label='Outliers (Isolation Forest)')
plt.title('Outliers Detected by Isolation Forest')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()

The n_estimators parameter will create 20 decision trees, and the contamination parameter is used to control the threshold value. If you increase the contamination parameter value to 0.01 the model will find more outliers, just play with it you will start to understand how it works.

Let’s see the scatter plot produced by the Isolation Forest technique.

Red is outlier!

Nice!

If you increase the contamination parameter value few blue dots which are spread out will also be considered as outliers.

Let’s see how Local Outlier Factor performs.

# Apply Local Outlier Factor
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.001)
outliers_lof = lof.fit_predict(X)

# Plot outliers detected by Local Outlier Factor
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c='blue', label='Inliers')
plt.scatter(X[outliers_lof == -1, 0], X[outliers_lof == -1, 1], c='red', label='Outliers (LOF)')
plt.title('Outliers Detected by Local Outlier Factor')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()
Same result!

That’s it.

This is what I did today. I hope it was helpful.

P.S. You can connect with me on X.

X: AbbasAli_X

YouTube: Abbas_Ali

--

--

Abbas Ali

I write about Data Science, Machine Learning, writing online, productivity, personal growth, audience building, and addiction. abbasaliebooks.gumroad.com