How does the DB-SCAN algorithm work?

ishika chatterjee
Analytics Vidhya
Published in
9 min readOct 6, 2019

--

The word cluster means that similar types of objects are grouped together in order to show that they behave in a similar way. Clustering problem is one of the most important problems because it is used to understand the type of data for those problems where the output variable is not given and as a result we need to understand the type of the data. So, there clustering algorithm comes into play.

Density-based spatial clustering of applications with noise (DB-SCAN) is the data clustering algorithm proposed in the early 90s by a group of database and data mining community.

Now before we begin, let’s understand the core concept of DB-SCAN.

Let a given dataset of points (dataset D = {xi}), we have to partition the point into the dense region which we will call them as Clusters and sparse region which may contain noise.

The sparse region is the region where the density of points is less and the portions where density of points is more is dense region.

Lets cover a few basic topics on db-scan:

eps: specifies how close points should be to each other to be considered a part of a cluster. It means that if the distance between two points is lower or equal to this value (eps), these points are considered to be neighbors.

minPoints: the minimum number of points to form a dense region. For example, if we set the minPoints parameter as 5, then we need at least 5 points to form a dense region. It’s basically known as min_samples.

There are three types of points in this algorithm- noise point, core point and border point. Lets discuss briefly about this:

Core point:

  • A point is a core point if it has more than a specified number of minPoints within eps radius around it.
  • Core Point always belongs in a dense region.
  • For example, let’s consider ‘p’ is set to be a core point if ‘p’ has ≥ minPoints in an eps radius around it.

Border point:

  • A point is said to be a border point if it has fewer than MinPts within Eps, but is in the neighborhood of a core point.
  • For example, p is set to be a border point if ‘p’ is not a core point. i.e ‘p’ has < minPoints in eps radius. But ‘p’ should belong to the neighborhood ‘q’. Where ‘q’ is a core point.
  • p ∈ neighborhood of q and distance(p,q) ≤ eps .

Noise point:

  • A noise point is any point that is not a core point or a border point

Density Edge:

If a and b both are core points and distance between (a,b) ≤ eps then we can connect p, q vertex in a graph and call it “Density Edge”.

Density Connected Points:

Two points aand b are said to be density connected points if both a and b are core points and they exist a path formed by density edges connecting point (a) to point(b).

DBSCAN algorithm:

So, given all these terms and ideas, let’s go into the core of the DBSCAN algorithm to see how it works.

Algorithm:

Step 1: ∀ xi ∈ D, Label points as the core, border, and noise points.

Step 2: Remove or Eliminate all noise points(because they belong to the sparse region. i.e they are not belonging to any clusters).

Step 3: For every core point p that has not yet been assigned to a cluster

a) Create a new cluster with the point p and

b) add all the points that are density-connected to p.

Step 4: Assign each border points to the cluster of the closest core point.

How the algorithm works:

First we choose two parameters, a positive number epsilon and a natural number minPoints. We then begin by picking an arbitrary point in our dataset. If there are more than minPoints points within a distance of epsilon from that point, (including the original point itself), we consider all of them to be part of a “cluster”. We then expand that cluster by checking all of the new points and seeing if they too have more than minPoints points within a distance of epsilon, growing the cluster recursively if so.

Eventually, we run out of points to add to the cluster. We then pick a new arbitrary point and repeat the process. Now, it’s entirely possible that a point we pick has fewer than minPoints points in its epsilon ball, and is also not a part of any other cluster. If that is the case, it’s considered a “noise point” not belonging to any cluster.

(There’s a slight complication worth pointing out: say minPoints=4, and you have a point with three points in its epsilon ball, including itself. Say the other two points belong to two different clusters, and each has 4 points in their epsilon balls. Then both of these dense points will “fight over” the original point, and it’s arbitrary which of the two clusters it ends up in. To see what I mean, try out “Example A” with minPoints=4, epsilon=1.98. Since DBSCAN considers the points in an arbitrary order, the middle point can end up in either the left or the right cluster on different runs. This kind of point is known as a “border point”).

When does DBSCAN work well?

  • It can handle Noise very well.
  • DB-SCAN can handle clusters of different shapes and sizes.

When not!

  • DBSCAN, while it’s so powerful and good for some applications. There is no single algorithm that always works. Every pro has some con side.
  • If your dataset has multiple densities or varying densities, DBSCAN tends to fail. It does not work very well in such cases.
  • It’s extremely sensitive to the hyperparameters. A slight change in hyper parameters can lead to a drastic change in the outcome.
  • As we know, a concept like Density, may not work well in high dimensionality of data. We should avoid using it for text data.

Time and Space complexity:

Time Complexity: O(n logn)

Space Complexity: O(n)

Implementation:

We can use one of the libraries/packages that can be found on the internet. Also, Sklearn has a DBSCAN implemented package.

Example with code:

Example of DBSCAN Algorithm with Scikit-Learn:

To see one realistic example of DBSCAN algorithm, I have used Canada Weather data for the year 2014 to cluster weather stations. First let’s load the data —

The data-frame consists of 1341 rows and 25 columns and to understand what column names represent, let’s take a look below for the most important features —

Since, I wanted to use different temperatures as few of the main features as a first try to cluster weather stations, first, let’s drop the rows that contain NaN values in the column ‘Mean Temperature (Tm)’, ‘Minimum Temperature (Tn)’, ‘Maximum Temperature (Tx)’.

After dropping the rows containing NaN values in the above mentioned columns, we are left with 1255 samples. Even though that’s almost ~7% of data loss, but given that we still have more than 1000 samples, let’s go ahead with the clustering.

Since we want to do spatial clustering and view the clustering in a map projection, along with different temperatures (‘Tm’, ‘Tn’, ‘Tx’), ‘Lat’, ‘Long’ should also be taken as features. Here, I have used Basemap Toolkit, a library to plot 2D data for plotting maps in Python. As mentioned in the Basemap documentation that “Basemap does not do any plotting on its own, but provides the facilities to transform coordinates to one of 25 different map projections”. One of the important properties of Basemap is — calling a Basemap class instance with the arguments latitude/longitude (in degrees, as in our data-frame), to get x/y map projection coordinates.

Let’s plot the weather stations using Basemap to get familiarized with it. Start by importing the necessary libraries

from mpl_toolkits.basemap import Basemap
import matplotlib
from PIL import Image
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = (14,10)

We are ready to call the Basemap class now —

Let me explain the code block in brief. I started off calling a Basemap class instance with Mercator projection (projection= ‘merc’), low resolution (resolution= ‘l’), and the boundaries of the map domain are given by the 4 parameters llcrnrlon, llcrnrlat, urcrnrlon, urcrnrlat, where llcrnrlon denotes longitude of lower left hand corner of the selected map domain and so on. Drawcoastlines, drawcountries do exactly what the names suggest, drawlsmask draws a high resolution land-sea mask as an image with land and ocean colors specified to orange and sky-blue. Latitude and Longitude are converted to x/y map projection coordinates using the command below —

xs, ys = my_map(np.asarray(weather_df.Long), np.asarray(weather_df.Lat))

These map projection coordinates will be used as features to cluster the data points spatially along with the temperatures. First let’s take a look below the weather stations —

Weather Stations in Canada, plotted using Basemap.

3.1 — Clustering the Weather Data (Temperatures & Coordinates as Features)

For clustering data, I’ve followed the steps shown in scikit-learn demo of DBSCAN.

Choosing temperatures (‘Tm’, ‘Tx’, ‘Tn’) and x/y map projections of coordinates (‘xm’, ‘ym’) as features and, setting ϵ and MinPts to 0.3 and 10 respectively, gives 8 unique clusters (noise is labeled as -1). Feel free to change these parameters to test how much clustering is affected accordingly.

Let’s visualize these clusters using Basemap —

8 Unique Clusters in Canada Based on Few Selected Features in the Weather Data. ϵ and MinPts set to 0.3 and 10 Respectively.

Finally, I included precipitation (‘P’) in the features and repeated the same clustering steps with ϵ and MinPts set to 0.5 and 10. We see some differences from the previous clustering and, thus it gives us an idea about problem of clustering unsupervised data even using DBSCAN when we lack the domain knowledge.

4 Unique Clusters in Canada Based on Selected Features (now included precipitation compared to previous case) in the Weather Data. ϵ and MinPts set to 0.5 and 10 Respectively.

You can try to repeat the process including some more features, or, change the clustering parameters, to get a better overall knowledge.

Conclusion:

To conclude this post, we went through some basic concepts of DBSCAN algorithm and tested this algorithm to cluster weather stations in Canada. The detail codes and all the pictures will be available in my GitHub . A direct comparison with K-Means clustering can be made to understand even better the differences between these algorithms. Hopefully, this will help you to get started with one of the most used clustering algorithm for unsupervised problems.

References :

Wikipedia: https://en.wikipedia.org/wiki/DBSCAN

Sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

Density Based Clustering Methods; Gao,J., Associate Professor Buffalo University. Presentation Link.

--

--