Unsupervised Learning: Hierarchical Clustering and DBSCAN

A gentle Introduction with An Implementation in Customer Dataset

Alifia C Harmadi
Analytics Vidhya
7 min readAug 15, 2021

--

Source: Google Images

There are lots of methods to group our data points in machine learning for further analysis based on similarity. As a data scientist or a data analyst, we know this method called clustering. The most common clustering algorithm that used is K-means Clustering Algorithm. It is so popular because this algorithm is so simple and powerful. It computes the centroids and iterates until we find an optimal centroid to group the data points with the number of groups is represented by K.

Yet, I am not going to discuss about the K-means. I am going to explain other clustering algorithms such as Hierarchical Clustering and DBSCAN. Some of you might already know this two algorithms, but for me, it is something new and I am obsessed with it. Anyway, Let’s start to the first one!!

Hierarchical Clustering

This algorithm is also popular like the K-means algorithm. In the K-means clustering, we need to define the number of clusters or K before doing the clustering while in Hierarchical Clustering, it is the algorithm itself that automatically find the potential number of clusters for us to decide by using a tree shape from called Dendrogram.

What is Dendrogram?

A dendrogram is a diagram that shows the hierarchical relationship between objects. By using this we could see the number of clusters that we are going to use for the next steps.

Source: Towards Data Science
Source: Google Images

We can see that there are two variables in the last diagram. The first one is the data points itself (sample index) and the second one is distance. Distance in this graph is the representation of Euclidean distance. From the diagram, we can see that there are one big clusters in overall. However, if we cut the the distance in 60, there would be only three clusters (the green one, the red one(number 3, 4 and 2), and the last red one). We should cut the distance because if the number of distance is to big, the data points in clusters might have no certain similarities.

There are two of hierarchical clustering techniques:

1. Agglomerative Hierarchical clustering

It is a bottom-up approach, initially, each data point is considered as a cluster of its own, the similar data points or clusters merge as one in further iteration of finding clusters until one cluster or K clusters are form.

Source: Geeks of Geeks

2. Divisive Hierarchical clustering (DIANA)

In contrast, DIANA is a top-down approach, it assigns all of the data points to a single cluster and then split the cluster to two least similar clusters. We proceed recursively on each cluster until there is one cluster for each observation.

Source: Geeks for Geeks

Implementation

  1. Visualise the dendrogram of our data
#Check the dendrogram 
plt.figure(figsize = (12, 8))
dendogram = sch.dendrogram(sch.linkage(X_pc, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Distance')
plt.show()
Dendrogram Diagram

2. Use Agglomerative Clustering to Cluster the data

3. Append the result of clustering in new dataset

Result:

Clustering Visualisation

Why Hierarchical Clustering?

It has an advantage of not having to pre-define the number of clusters gives it quite an edge over algorithm such as K-Means. Yet, it does not work well when we have huge amount of data.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Now as we already talked about Partitioning method (K-means) and hierarchical clustering, we are going to talked about density-based spatial clustering of applications with noise (DBSCAN) method. Mostly, clustering algorithms work to look for spherical-shaped clusters and severely affect by the presence of noise and outlier in the data. Thus, the result of clustering is sometimes not quite good for real life data which contains lots of noise.

Source: Ryan Wingate

Here, the DBSCAN come to solve this issue. Clusters are dense regions in the data space, separated by regions of the lower density of points. The DBSCAN is based on this intuitive notion of “clusters” and “noise”. The key idea is that for each point of a cluster, the neighbourhood of a given radius has to contain at least a minimum number of points.

DBSCAN Parameters

1. Eps

Eps is the maximum radius of the neighbourhood. It defines the neighbourhood around a data point i.e. if the distance between two points is lower or equal to ‘eps’ then they are considered as neighbours. One way to find the eps value is based on the k-distance graph.

Source: Google Images with some edits by the Author

If the eps value is too small then large part of the data will be considered as outliers. If it is very large then the clusters will merge and majority of the data points will be in the same clusters.

2. MinPts

MinPts is minimum number of points in an Eps-neighbourhood (within eps radius) of that point. Larger the dataset, the larger value of MinPts must be chosen. Arule of thumb is to derive minPts from the number of dimensions D in the data set. minPts >= D + 1. For 2D data, take minPts = 4.

Source: Geeks for Geeks

Core: a point that has at least m points within distance n of itself.

Border: a point that has at least one core point at a distance n.

Noise: a point that is neither a core nor a border. It just like an outlier. It has less than m points within distance n from itself.

That are the 3 types of data points that we have in this algorithm.

Implementation

  1. Fit the data to DBSCAN algorithm
#Use DBSCAN, -1 value means outliers
dbscan = DBSCAN(eps = 10, min_samples = 5)
y_pc_db = dbscan.fit_predict(X_pc)
y_pc_db

Result:

array([-1,  0, -1,  0, -1,  0, -1, -1, -1,  0, -1, -1, -1,  0, -1,  0, -1,
0, -1, -1, -1, 0, -1, 0, -1, 0, -1, -1, -1, 0, -1, 0, -1, -1,
-1, 0, -1, 0, -1, 0, -1, -1, 1, 1, -1, -1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, -1, 2, -1, 2, 1, 2, -1, 2, -1, 2, -1, 2, -1, 2,
-1, 2, -1, 2, -1, 2, -1, 2, -1, 2, -1, 2, -1, 2, 3, 2, 3,
2, -1, 2, -1, 2, -1, 2, -1, 2, -1, 2, -1, 2, 3, 2, 3, -1,
3, 2, -1, 2, -1, 2, -1, 2, -1, 2, -1, 2, -1, 2, -1, 2, -1,
-1, -1, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])

You can notice that the result of clustering has value of -1. These values are the noise in the dataset.

2. Append the result of clustering in new dataset

Result:

3D Clustering Visualisation

Remember before, we have -1 values, those blue points is the-1 values (the noise). Here, we have four clusters. We can check if it is true using code down below.

target = X_pc_db['cluster']
print(target.nunique()) # number of clusters

Result:

4

Why DBSCAN?

It can automatically detect the number of clusters based on your input data
and parameters. DBSCAN can handle noise and outliers. All the outliers will be identified and marked without been classified into any cluster. Therefore, DBSCAN can also be used for Anomaly Detection (Outlier Detection). However, this algorithm is still difficult for heavy-cluttered dataset.

I am planing to explain another clustering algorithm which combining these two method, hierarchical and DBSCAN clustering. Can you imagine how powerful that algorithms will be?

Let me give you a hint, the name of this algorithm is start with H and there is the word of DBSCAN on in. H stands for Hierarchical and DBSCAN is for, you know, the DBSCAN Algorithm. So, it will publish not really soon as I need to get the mood to write that. But hopefully, you have something to take away after reading this explanation about Hierarchical Clustering and DBSCAN.

--

--

Alifia C Harmadi
Analytics Vidhya

A philomath & data engineer. Passionate about ML, DL & data storytelling.