DBSCAN — Overview, Example, & Evaluation

Tara Mullin
3 min readJul 10, 2020

--

DBSCAN Overview

Clustering is an unsupervised learning technique used to group data based on similar characteristics when no pre-specified group labels exist. This technique is used for statistical data analysis across many fields.

There are various algorithms that can be used to perform clustering. Recently, I experimented with a clustering algorithm called DBSCAN (Density-Based Spatial Clustering of Applications with Noise). DBSCAN is a density-based clustering algorithm used to identify clusters of varying shape and size with in a data set (Ester et al. 1996).

Advantages of DBSCAN over other clustering algorithms:

  1. DBSCAN does not require a pre-determined set number of clusters
  2. DBSCAN identifies outliers as noise, instead of classifying them into a cluster.
  3. DBSCAN is more flexible when it comes to the size and shape of clusters than other partitioning methods, such as K-means. It is able to identify clusters that differ in size and shape from one another, which makes it more useful for messy, real life data.

DBSCAN Parameters

DBSCAN has two parameters. The first is ε, epsilon (“esp”), which defines the maximum distance allowed between two points within the same cluster. The second is minimum samples (“MinPts”), which defines the minimum number of data points required to form a distinct cluster. So, MinPts is the minimum number of neighbors contained within a cluster with radius/max length of esp. You can find a post I wrote about DBSCAN parameter estimation here.

Example model

I recently built my own DBSCAN model. I chose DBSCAN primarily because you don’t need to specify the number of clusters. I didn’t know how many groups existed within the data set and identifying this number was one of the main points of my project. I used Scikit Learn to run the DBSCAN algorithm. See sklearn.cluster.DBSCAN documentation here.

from sklearn import cluster

I had previously estimated the DBSCAN parameters (more detail here) MinPts = 20 and ε = 225. The data we put into DBSCAN should be “array-like” or “sparse matrix” in the shape of (n_samples, n_features).

eps = 225
min_samples = 20
dbscan = cluster.DBSCAN(eps=eps, min_samples=min_samples)
clustering_labels = dbscan.fit_predict(data_array)

Append the results (clustering_labels) to the original dataframe — the dataframe the data array was derived from.

df['labels'] = clustering_labels

Evaluate Model Performance — Mean Silhouette Coefficient

If the true cluster labels are unknown, as was the case with my data set, the model itself must be used to evaluate performance. An example of this type of evaluation is the Silhouette Coefficient.

The Silhouette Coefficient is bounded between 1 and -1. The best value is 1, the worst is -1. A higher score indicates that the model has better defined, more dense clusters. Values close to 0 indicate overlapping clusters, while negative values usually indicate that data points have been assigned to the wrong clusters. A paper about silhouettes can be found here.

Two scores are used to calculate the silhouette coefficient:

  • a: The average distance between one data point and all other points in the same cluster
  • b: The average distance between one data point and all other points in the next nearest cluster.
Formula to calculate the Silhouette Coefficient

I used SciKit Learn to calculate the Silhouette Coefficient. Documentation here.

from sklearn import metrics
metrics.silhouette_score(df, df['labels'])

My model received a Silhouette Coefficient score of 0.461. This is a decent score, indicating that my model doesn’t have overlapping clusters or mislabeled data points.

Resources I learned a lot from while building my own DBSCAN model and writing this blog post:

--

--