Anomaly Detection with Unsupervised Machine Learning

Hiraltalsaniya
Simform Engineering
9 min readDec 22, 2023

Detecting Outliers and Unusual Data Patterns with Unsupervised Learning

In an era of big data, anomaly detection has become a crucial capability for unlocking hidden insights and ensuring data integrity. This blog dives into the world of unsupervised machine learning techniques to detect outliers efficiently without labeled data.

We introduce key anomaly detection concepts, demonstrate anomaly detection methodologies and use cases, compare supervised and unsupervised models, and provide a step-by-step implementation guide using DBSCAN in Python.

What is an anomaly?

An anomaly is basically something that’s unusual, doesn’t fit the usual pattern, or stands out because it’s different in a specific category or situation. To explain it simply, let’s look at some clear examples:

  • Think about a collection of smartphones, mostly from Samsung, and then there’s an iPhone. The iPhone is an anomaly because it’s a different brand.
  • Imagine you have a bunch of pens, but one of them is a fancy fountain pen instead of a regular ballpoint pen. That fountain pen is an anomaly because it’s not like the others.

What is anomaly detection?

Anomaly detection is a technique used to identify data points that are significantly different or “outliers” when compared to the majority of the data in a dataset.

Anomaly detection is about finding data points that are different from what is considered normal or expected, and it relies on historical data or established knowledge to determine what falls within the usual range. It plays a crucial role in ensuring the quality and security of data in various domains.

Example of anomaly detection in server logs:

Normal behavior:

  • Website traffic follows a regular pattern.
  • Requests per minute show a predictable trend, with slight increases during peak hours.

Anomaly:

  • Suddenly, there is an unusual, significant surge in traffic.
  • This spike in requests per minute is an anomaly in the server logs.

Anomaly detection use cases

Here are some diverse applications of anomaly detection using machine learning:

  1. Event detection in sensor networks
  2. Manufacturing quality control
  3. Healthcare monitoring
  4. Social media monitoring
  5. Fraud detection
  6. Network intrusion detection
  7. Healthcare monitoring
  8. Insurance claim analysis
  9. Cybersecurity threat detection
  10. Identity theft
  11. Traffic monitoring
  12. Network intrusion detection
  13. Data breaches
  14. Intrusion detection
  15. Video surveillance

The three settings for anomaly detection, as described by Dr. Thomas Dietterich and his team at Oregon State University in 2018:

  1. Supervised Anomaly Detection: In this setting, the anomaly detection model is trained on a labeled dataset, which means that each data point is explicitly marked as either normal or anomalous. The model learns the characteristics of normal data and uses this knowledge to detect anomalies in new, unseen data. Supervised anomaly detection is effective when you have a reliable labeled dataset for training, and it is suitable for scenarios where anomalies are relatively easy to define and identify.
    ML Algorithm for structured data:
    - Bayesian networks
    - k-nearest neighbors (KNN)
    - Decision trees
  2. Clean Anomaly Detection: Clean anomaly detection refers to situations where the data is mostly clean and free from noise or errors, making it easier to detect anomalies. In this setting, the focus is on identifying significant deviations from the established normal patterns. Clean anomaly detection is commonly used in applications where the data is well-structured and follows predictable patterns, such as fraud detection in financial transactions or quality control in manufacturing.
  3. Unsupervised Anomaly Detection: Unsupervised anomaly detection occurs when there are no labeled anomalies in the training data, and the model needs to identify anomalies without prior knowledge of what constitutes an anomaly. The model’s task is to find data points that deviate significantly from the majority of the data, making it suitable for cases where anomalies are rare or poorly understood.
    ML algorithm for unstructured data:
    - K-means
    - One-class support vector machine

Here are some common approaches to anomaly detection:

  1. Statistical methods:
    Z-Score/Standard Score:
    This method measures how many standard deviations a data point is away from the mean. Points that fall far from the mean are considered anomalies.
    Percentiles: Identifying anomalies based on percentiles or quantiles, where values below or above a certain threshold are considered outliers.
  2. Machine learning algorithms:
    Isolation Forest
    : An ensemble learning method that builds a tree structure to isolate anomalies efficiently.
    One-Class SVM: A support vector machine (SVM) model trained to classify data points as normal or outliers.
    K-Nearest Neighbors (KNN): Assigns an anomaly score based on the distance to the K-nearest neighbors, with distant points being potential anomalies.
    Autoencoders: Neural networks designed to learn a compressed representation of data, where reconstruction error can be used to identify anomalies.
  3. Clustering methods:
    DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
    Clusters data points based on their density, with points that do not belong to any cluster considered outliers.
    K-Means Clustering: Data points that do not belong to well-defined clusters may be considered anomalies.
  4. Time-series analysis:
    Moving Averages:
    Identifying anomalies based on deviations from the moving average or exponential moving average.
    Seasonal Decomposition: Decomposing a time series into its trend, seasonal, and residual components, with anomalies often detected in the residual component.
  5. Proximity-based approaches:
    Mahalanobis Distance:
    Measures the distance of data points from the center of the data distribution, considering correlations between features.
    Local Outlier Factor (LOF): Computes the local density deviation of a data point compared to its neighbors, identifying regions of different densities.

Let’s dive a bit deeper into how DBSCAN works with a simple analogy

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clever way to find unusual or outlier data points in a group of data. Imagine you have a bunch of points on a map, and you want to find the weird ones that don’t really fit into any group.

Here’s how DBSCAN works:

Step 1: Select a starting point

  • Begin by randomly selecting a data point from your dataset.

Step 2: Define a radius (Epsilon) and minimum number of oints (Min_Samples)

Specify two important values:

  • Epsilon (a radius around the selected point).
  • Min_Samples (the minimum number of data points that should be within this radius to form a cluster)

Step 3: Check neighboring points

  • Examine all data points within the defined radius (Epsilon) around the selected point.

Step 4: Form a cluster

  • If there are at least as many data points within the Epsilon radius as specified by Min_Samples, consider the selected point and these nearby points as a cluster.

Step 5: Expand the cluster

  • Now, for each point within this newly formed cluster, repeat the process. Check for nearby points within the Epsilon radius.
  • If additional points are found, add them to the cluster. This process continues iteratively, expanding the cluster until no more points can be added.

Step 6: Identify outliers (noise)

  • Any data points that are not included in any cluster after the expansion process are labeled as outliers or noise. These points do not belong to any cluster.

Imagine you have a field with a bunch of people scattered around, and you want to organize a game of tag. Some people are standing close together, and others are standing alone. DBSCAN helps you identify two things:

  1. Groups of Players: It starts by picking a person, any person, and puts an imaginary hula hoop around them (this is like setting a maximum distance). Now, it checks how many other people are inside that hula hoop. If there are enough (more than a certain number you decide in advance), it forms a group. This group is like a team of players playing tag.
  2. Lonely Players: After forming that group, it picks a person within that group, puts a hula hoop around them, and checks if there are more people inside. If yes, it adds them to the group. This process continues until there are no more people to add to that group.

Now, here’s the cool part: Anyone who doesn’t end up in a group is the outlier or the “lonely player.” These are the people who don’t belong to any team, or in data terms, they are the outliers.

To apply DBSCAN for outlier detection in Python using Scikit-Learn, we begin by importing the necessary libraries and modules, as follows:

Step 1: Import necessary libraries

  • The code starts by importing the required Python libraries, including NumPy for numerical operations, Matplotlib for data visualization, and the DBSCAN class from scikit-learn for implementing the DBSCAN algorithm.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs

Step 2: Create a synthetic dataset

# Create a synthetic dataset with normal and anomalous data points
n_samples = 300
X, y = make_blobs(n_samples=n_samples, centers=2, random_state=42, cluster_std=1.0)
anomalies = np.array([[5, 5], [6, 6], [7, 7]])
  • In this step, a synthetic dataset is generated to illustrate the concept. The dataset is created using the make_blobs function, producing two clusters of data points with some isolated anomalies.
  • n_samples determines the total number of data points, and the centers parameter specifies the number of clusters (2, in this case).
  • The anomalies variable is an array of manually created anomalous data points.

Step 3: Combine normal and anomalous data

# Combine the normal data and anomalies
X = np.vstack([X, anomalies])
  • The normal data and anomalies are combined into a single dataset represented by the X array using np.vstack.

Step 4: Visualize the dataset

# Visualize the dataset
plt.scatter(X[:, 0], X[:, 1], c='b', marker='o', s=25)
plt.title("Synthetic Dataset")
plt.show()
  • The code plots the dataset to provide a visual representation. It uses Matplotlib to create a scatter plot, where normal data points are marked in blue circles.
  • The resulting plot visually shows two clusters and some isolated red crosses representing the anomalies.

Step 5: Apply DBSCAN for anomaly detection

# Apply DBSCAN for anomaly detection with increased epsilon
dbscan = DBSCAN(eps=1, min_samples=41) # Increase eps
labels = dbscan.fit_predict(X)

# Anomalies are considered as points with label -1
anomalies = X[labels == -1]
  • DBSCAN is applied for anomaly detection using the DBSCAN class from scikit-learn. The parameters eps (epsilon) and min_samples control the algorithm's behavior.
  • The eps parameter sets the radius within which points are considered neighbors.
  • The min_samples parameter defines the minimum number of points required to form a cluster.
  • The code then fits the DBSCAN model to the dataset using fit_predict to obtain cluster labels for each data point.

Step 6: Identify anomalies

# Anomalies are considered as points with label -1
anomalies = X[labels == -1]
  • Anomalies are identified by finding data points labeled as -1. These points do not belong to any cluster and are considered outliers or anomalies.

Step 7: Visualize the anomalies

# Visualize the anomalies
plt.scatter(X[:, 0], X[:, 1], c='b', marker='o', s=25)
plt.scatter(anomalies[:, 0], anomalies[:, 1], c='r', marker='x', s=50, label='Anomalies')
plt.title("Anomaly Detection with DBSCAN (Anomalies Outside Clusters)")
plt.legend()
plt.show()
  • The code plots the anomalies found by DBSCAN in red crosses on top of the original data points.
  • This visualization helps to highlight the anomalies detected by the algorithm.

Step 8: Print the identified anomalies

# Print the identified anomalies
print("Identified Anomalies:")
print(anomalies)
  • The code concludes by printing the coordinates of the identified anomalies, allowing you to see the specific data points classified as anomalies by the DBSCAN algorithm.

By following these steps, you can effectively identify an anomaly with DBSCAN and visualize its results.

Conclusion

DBSCAN is a valuable tool for anomaly detection, offering a data-driven approach to uncovering outliers in complex datasets. By following the step-by-step guide and code provided in this blog post, you can integrate DBSCAN into your own data analysis projects, enhance your anomaly detection capabilities, and make more informed decisions based on the unique insights that outliers can provide.

Follow Simform Engineering to keep yourself updated with the latest trends in the technology horizon. Follow us: Twitter | LinkedIn

References

--

--