DBSCAN Parameter Estimation Using Python

3 min readJul 10, 2020

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised machine learning technique used to identify clusters of varying shape in a data set (Ester et al. 1996). Another post I wrote goes into what DBSCAN is and when to use it. You can find it here. This post will focus on estimating DBSCAN’s two parameters:

Minimum samples (“MinPts”): the fewest number of points required to form a cluster
ε (epsilon or “eps”): the maximum distance two points can be from one another while still belonging to the same cluster

Minimum Samples (“MinPts”)

There is no automatic way to determine the MinPts value for DBSCAN. Ultimately, the MinPts value should be set using domain knowledge and familiarity with the data set. From some research I’ve done, here are a few rules of thumb for selecting the MinPts value:

The larger the data set, the larger the value of MinPts should be
If the data set is noisier, choose a larger value of MinPts
Generally, MinPts should be greater than or equal to the dimensionality of the data set
For 2-dimensional data, use DBSCAN’s default value of MinPts = 4 (Ester et al., 1996).
If your data has more than 2 dimensions, choose MinPts = 2*dim, where dim= the dimensions of your data set (Sander et al., 1998).

Epsilon (ε)

After you select your MinPts value, you can move on to determining ε. One technique to automatically determine the optimal ε value is described in this paper. This technique calculates the average distance between each point and its k nearest neighbors, where k = the MinPts value you selected. The average k-distances are then plotted in ascending order on a k-distance graph. You’ll find the optimal value for ε at the point of maximum curvature (i.e. where the graph has the greatest slope).

Example

Recently, I worked with a data set with 10 dimensions. Following the rules above, my minimum samples value should be at least 10. Ultimately, I decided to go with 20 (2*dim), as suggested in papers 1 and 2 listed under resources at the end of this post.

After I selected my MinPts value, I used NearestNeighbors from Scikit-learn, documentation here, to calculate the average distance between each point and its n_neighbors. The one parameter you need to define is n_neighbors, which in this case is the value you choose for MinPts.

Step 1: Import Libraries

from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt

Step 2: Calculate the average distance between each point in the data set and its 20 nearest neighbors (my selected MinPts value).

neighbors = NearestNeighbors(n_neighbors=20)
neighbors_fit = neighbors.fit(dataset)
distances, indices = neighbors_fit.kneighbors(dataset)

Step 3: Sort distance values by ascending value and plot

distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)

For my data set, the sorted distances produces a k-distance elbow plot that looks like this:

Points sorted by distance to the 20th nearest neighbor

The ideal value for ε will be equal to the distance value at the “crook of the elbow”, or the point of maximum curvature. This point represents the optimization point where diminishing returns are no longer worth the additional cost. This concept of diminishing returns applies here because while increasing the number of clusters will always improve the fit of the model, it also increases the risk that overfitting will occur.

Zoom in on “elbow” optimization point, points sorted by distance to the 20th nearest neighbor

Zooming in on my k-distance plot, it looks like the optimal value for ε is around 225. I ended up looping through combinations of MinPts and ε values slightly above and below the values estimated here to find the model of best fit.

Resources that helped me estimate the parameters for my DBSCAN model and write this post:

DBSCAN Parameter Estimation Using Python

Minimum Samples (“MinPts”)

Epsilon (ε)

Example

Written by Tara Mullin