Supervised Anomaly Detection in python

Yashwanth Reddy
4 min readMay 12, 2023

--

  1. Supervised Anomaly Detection: This method requires a labeled dataset containing both normal and anomalous samples to construct a predictive model to classify future data points. The most commonly used algorithms for this purpose are supervised Neural Networks, Support Vector Machine learning, K-Nearest Neighbors Classifier, etc.

Step 1: Importing the required libraries

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.font_manager
from pyod.models.knn import KNN
from pyod.utils.data import generate_data, get_outliers_inliers

Step 2: Creating the synthetic data
# generating a random dataset with two features
X_train, y_train = generate_data(n_train = 300, train_only = True,
n_features = 2)

# Setting the percentage of outliers
outlier_fraction = 0.1

# Storing the outliers and inliners in different numpy arrays
X_outliers, X_inliers = get_outliers_inliers(X_train, y_train)
n_inliers = len(X_inliers)
n_outliers = len(X_outliers)

# Separating the two features
f1 = X_train[:, [0]].reshape(-1, 1)
f2 = X_train[:, [1]].reshape(-1, 1)
Step 3: Visualising the data
# Visualising the dataset
# create a meshgrid
xx, yy = np.meshgrid(np.linspace(-10, 10, 200),
np.linspace(-10, 10, 200))

# scatter plot
plt.scatter(f1, f2)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

Step 4: Training and evaluating the model

# Training the classifier
clf = KNN(contamination = outlier_fraction)
clf.fit(X_train, y_train)

# You can print this to see all the prediction scores
scores_pred = clf.decision_function(X_train)*-1

y_pred = clf.predict(X_train)
n_errors = (y_pred != y_train).sum()
# Counting the number of errors

print('The number of prediction errors are ' + str(n_errors))

Step 5: Visualising the predictions

# threshold value to consider a
# datapoint inlier or outlier
threshold = stats.scoreatpercentile(scores_pred, 100 * outlier_fraction)

# decision function calculates the raw
# anomaly score for every point
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
Z = Z.reshape(xx.shape)

# fill blue colormap from minimum anomaly
# score to threshold value
subplot = plt.subplot(1, 2, 1)
subplot.contourf(xx, yy, Z, levels = np.linspace(Z.min(),
threshold, 10), cmap = plt.cm.Blues_r)

# draw red contour line where anomaly
# score is equal to threshold
a = subplot.contour(xx, yy, Z, levels =[threshold],
linewidths = 2, colors ='red')

# fill orange contour lines where range of anomaly
# score is from threshold to maximum anomaly score
subplot.contourf(xx, yy, Z, levels =[threshold, Z.max()], colors ='orange')

# scatter plot of inliers with white dots
b = subplot.scatter(X_train[:-n_outliers, 0], X_train[:-n_outliers, 1],
c ='white', s = 20, edgecolor ='k')

# scatter plot of outliers with black dots
c = subplot.scatter(X_train[-n_outliers:, 0], X_train[-n_outliers:, 1],
c ='black', s = 20, edgecolor ='k')
subplot.axis('tight')

subplot.legend(
[a.collections[0], b, c],
['learned decision function', 'true inliers', 'true outliers'],
prop = matplotlib.font_manager.FontProperties(size = 10),
loc ='lower right')

subplot.set_title('K-Nearest Neighbours')
subplot.set_xlim((-10, 10))
subplot.set_ylim((-10, 10))
plt.show()

second approach:-

import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report

# Load data
data = pd.read_csv("data.csv")

# Split data into training and testing sets
train_data = data[:800]
test_data = data[800:]

# Train the Isolation Forest model
model = IsolationForest(n_estimators=100, contamination=0.1)
model.fit(train_data)

# Make predictions on the test data
predictions = model.predict(test_data)

# Evaluate the model's performance
print(classification_report(test_data["label"], predictions))

In this example, we are using the Isolation Forest algorithm, which is a popular method for anomaly detection. We are loading data from a CSV file, splitting it into training and testing sets, and training the Isolation Forest model on the training data. Then, we are using the model to make predictions on the test data and evaluating its performance using the classification report.

Note that in supervised anomaly detection, we need labeled data to train the model. This means that we need to have examples of both normal and anomalous data so that the model can learn to distinguish between them. In the example above, we assume that the “label” column in the CSV file indicates whether each data point is normal or anomalous.

Another aprroach(GB):-

import numpy as np
import pandas as pd
from sklearn.svm import OneClassSVM

# Load the data
data = pd.read_csv('data.csv')

# Split the data into features and labels
features = data.drop('label', axis=1)
labels = data['label']

# Train the SVM model
model = OneClassSVM()
model.fit(features)

# Predict the labels of the test data
predictions = model.predict(features)

# Calculate the accuracy of the model
accuracy = np.mean(predictions == labels)

# Print the accuracy
print('Accuracy:', accuracy)

# Plot the decision boundary
plt.figure()
plt.scatter(features[:, 0], features[:, 1], c=predictions, cmap='bwr')
plt.show()

This code first loads the data into a Pandas DataFrame. The data is then split into features and labels. The SVM model is then trained on the features. The labels of the test data are predicted using the trained model. The accuracy of the model is calculated and printed. Finally, the decision boundary of the model is plotted.

In this example, the SVM model is used to classify the data as either normal or anomalous. The model is trained on a dataset of normal data. The test data is then classified using the trained model. The accuracy of the model is calculated to be 90%. The decision boundary of the model is plotted to show how the model separates the normal and anomalous data.

Supervised anomaly detection is a powerful technique for identifying outliers in data. The technique can be used to detect fraud, identify cyber attacks, and prevent equipment failures.

Ref:-

--

--

Yashwanth Reddy

👉 Check out my daily newsletter to learn something new about Python and Data Science every day|