Clustering by DBSCAN (Density-Based Spatial Clustering of Applications with Noise) Clearly Explained with Coding in Python

Shrimanta Satpati
8 min readDec 7, 2023

--

Clustering is a technique in machine learning and data analysis that involves grouping similar data points together based on certain features or characteristics. The goal of clustering is to find natural groupings within a dataset without explicit supervision, meaning the algorithm tries to discover patterns in the data without being explicitly told what to look for.

But the most popular clustering techniques like the K-Means (assumes spherical clusters), Gaussian Mixture Model (GMM) (assumes normal distribution within clusters), Hierarchical Clustering (higher computational complexity) suffer from the following limitations.

  1. Minimal requirements of domain knowledge to determine the input parameters (e.g., number of clusters) (K-Means, GMM)
  2. Unable to discover clusters of arbitrary shape (non-spherical or non-Gaussian). (K-Means, GMM)
  3. Difficulty handling noisy data or outliers. Sensitive to the choice of distance metric. (Hierarchical Clustering)
DBSCAN handles arbitrary shaped clusters.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm is able to overcome these limitations.

  • It infers the number of clusters based on the data.
  • It can discover clusters of arbitrary shape.
  • Robust to outliers and no dependency on the distance metric.

DBSCAN relies on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. It defines clusters as continuous regions of high density. It can be used for anomaly/outlier detection as well.

Below we provide a preliminary to understand various aspects of the DBSCAN model.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

DBSCAN: Preliminary

The DBSCAN technique has two parameters:

- eps (ε epsilon): The radius of the neighborhoods around a data point p.

- min_samples: The minimum number of data points we want in a neighborhood to define a cluster.

For each instance, the algorithm counts how many instances are located within a small distance epsilon (ε) from it. This region is called the instance’s epsilon (ε) -neighborhood.

DBSCAN clustering technique

The DBSCAN algorithm involves three different types of points.

  1. Core Point

A point p is a if at least min_samples points are within distance eps ε of it (including p). The min_samples points are the minimum number of neighbors a given point should have in order to be classified as a core point.

Point A and the other red points are core points, because the area surrounding these points in an ε radius contain at least 4 points (including the point itself). The clusters are built around the core points.

So, by adjusting the min_samples parameter, we can fine-tune how dense our clusters must be. All instances in the neighborhood of a core point belong to the same cluster. This neighborhood may include other core points. Therefore, a long sequence of neighboring core points forms a single cluster (red points).

2. Border Points

The yellow points are non-core or border points. These points are within ε radius of core points but do not contain minimum number of neighbors.

3. Noise/Anomaly/Outlier Points

Any instance that is not a core point and nor are they close enough to a cluster to be density-reachable from a core point is considered a noise/anomaly/outlier (blue point).

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Time-Complexity

DBSCAN visits each point of the database, possibly multiple times (e.g., as candidates to different clusters).

For each point it scans all points in the dataset, computes distance and check ε.

i. Thus, the worst-case run-time complexity is O(n²).

ii. If an indexing structure is used that executes a neighborhood query in O(log n).

iii. Thus, an overall average run-time complexity is O(n log n).

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Coding

In this notebook we use Scikit-Learn’s DBSCAN object to discover clusters from an arbitrary shaped data.

import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_circles
from sklearn import metrics

Creating a Synthetic Dataset

To create arbitrary shaped clusters, we use Scikit-Learn’s “datasets.make_circles” function. It creates a large circle containing a smaller circle in 2D.

The following two parameters are used by the make_circles function:

- factor: Scale factor between inner and outer circle. Range 0 ~ 1.

- noise: Standard deviation of Gaussian noise added to the data.

X, y = make_circles(n_samples=1000, noise=0.05, factor=0.6)

plt.figure(figsize=(12, 6))
plt.scatter(X[:, 0], X[:, 1], c=None, s=5, cmap='autumn')

plt.title("Two Arbitrary Shaped Clusters", fontsize=16)
plt.xlabel("$x_1$", fontsize=14)
plt.ylabel("$x_2$", fontsize=14, rotation=0)
plt.show()

How do we set these two parameters of DBSCAN?

eps:

- Very small value: There wouldn’t be enough points in a region to form a cluster. It would make most of the points outliers.

- Very large value: The majority of the points would fall into one cluster. There would be almost no outliers. More often or not, a smaller value is preferred for eps.

min_samples:

- Very large value: It would need more points to form a cluster. Then, it would leave a major chunk of points as outliers.

- Very small value: It would make clusters form even for what could have been an outlier.

dbscan = DBSCAN(eps=0.05, min_samples=5)
dbscan.fit(X)

Output: DBSCAN(algorithm=’auto’, eps=0.05, leaf_size=30, metric=’euclidean’, metric_params=None, min_samples=5, n_jobs=None, p=None)

Cluster Labels

The labels of all the instances are now available in the labels_ instance variable. We see that some instances have a cluster index equal to –1, which means that they are considered as anomalies by the algorithm.

- The indices of the core instances are available in the core_sample_indices_ instance variable.

- The core instances themselves are available in the components_ instance variable.

#print(dbscan.labels_)
np.unique(dbscan.labels_)

Output: array([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51])

Clustering: Observation and Plot the Clusters

Notice that 41 clusters are created, which is clearly incorrect.

We use the following function to plot the clusters discovered by DBSCAN.

def plot_dbscan(dbscan, X, size, show_xlabels=True, show_ylabels=True):
core_mask = np.zeros_like(dbscan.labels_, dtype=bool)
core_mask[dbscan.core_sample_indices_] = True
anomalies_mask = dbscan.labels_ == -1
non_core_mask = ~(core_mask | anomalies_mask)

cores = dbscan.components_
anomalies = X[anomalies_mask]
non_cores = X[non_core_mask]

plt.scatter(cores[:, 0], cores[:, 1],
c=dbscan.labels_[core_mask], marker='o', s=size, cmap="Paired")
plt.scatter(cores[:, 0], cores[:, 1], marker='*', s=20, c=dbscan.labels_[core_mask])
plt.scatter(anomalies[:, 0], anomalies[:, 1],
c="r", marker="x", s=100)
plt.scatter(non_cores[:, 0], non_cores[:, 1], c=dbscan.labels_[non_core_mask], marker=".")
if show_xlabels:
plt.xlabel("$x_1$", fontsize=14)
else:
plt.tick_params(labelbottom=False)
if show_ylabels:
plt.ylabel("$x_2$", fontsize=14, rotation=0)
else:
plt.tick_params(labelleft=False)
plt.title("eps={:.2f}, min_samples={}".format(dbscan.eps, dbscan.min_samples), fontsize=14)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, dbscan.labels_))

plt.figure(figsize=(10, 6))
plot_dbscan(dbscan, X, size=100)
plt.show()

Output: Silhouette Coefficient: 0.170

Improve the Quality of Clustering

- Fine-tune how dense our clusters must be by adjusting the min_samples parameter.

- Control the number of clusters by the ε parameter.

In the above figure we have so many clusters because the epsilon (ε) parameter was very small. I.e., the radius of the neighborhoods around a data point was very small.

Below we increase epsilon (ε) to reduce the number of clusters.

dbscan2 = DBSCAN(eps=0.1, min_samples=5)
dbscan2.fit(X)

Output: DBSCAN(algorithm=’auto’, eps=0.1, leaf_size=30, metric=’euclidean’, metric_params=None, min_samples=5, n_jobs=None, p=None)

np.unique(dbscan2.labels_)

Output: array([-1, 0, 1])

print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, dbscan2.labels_))

plt.figure(figsize=(10, 6))

plot_dbscan(dbscan2, X, size=600, show_ylabels=False)

plt.show()

Output: Silhouette Coefficient: -0.055

labels = dbscan2.labels_

print(len(labels))

print(X.shape)

plt.figure(figsize=(12, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=5, cmap='viridis');

Output: 1000

(1000, 2)

Observation

We see that by increasing epsilon (ε), DBSCAN is able to discover two clusters that we had in the original dataset.

Increase the Density of the Clusters

We can increase the density of the clusters by adjusting the min_samples parameter.

We increase the min_samples parameter in the above model.

dbscan3 = DBSCAN(eps=0.1, min_samples=12)
dbscan3.fit(X)

Output: DBSCAN(algorithm=’auto’, eps=0.1, leaf_size=30, metric=’euclidean’, metric_params=None, min_samples=12, n_jobs=None, p=None)

np.unique(dbscan3.labels_)

Output: array([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, dbscan3.labels_))

plt.figure(figsize=(10, 6))

plot_dbscan(dbscan3, X, size=600, show_ylabels=False)

plt.show()

Output: Silhouette Coefficient: 0.014

Observation

We observe that as we try to increase the density of the previous two clusters, DBSCAN does so by increasing the number of clusters.

Thus, we should be able to perform fine-tuning of the eps and mon_samples parameters to discover the optimal number of clusters.

Limitations

Due to its distance-based calculation, DBSCAN suffers from the curse of dimensionality. DBSCAN cannot cluster data sets well with large differences in densities.

This is because the min_samples-eps combination cannot then be chosen appropriately for all clusters.

- DBSCAN cannot cluster datasets well with similar densities when there is a significant overlap in the distributions.

DBSCAN is a powerful technique for both clustering and anomaly detection.

Conclusion

In this article, I provided a comprehensive explanation of the DBSCAN clustering algorithm and highlighted its advantages compared to other clustering algorithms. It’s worth noting the existence of an improved and more recent version called HDBSCAN, which incorporates Hierarchical Clustering with traditional DBSCAN. HDBSCAN is recognized for its enhanced speed and accuracy, surpassing the performance of DBSCAN.

I hope you enjoyed this article. In case you have any queries, let me in the comments below.

--

--

Shrimanta Satpati

Data Scientist | Passionate about AI, ML, Computer Vision and Deep Learning | Connect me - LinkedIn