Hierarchical Clustering and Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Introduction
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure is known as the dendrogram. Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they both differ depending on how they work. As there is no requirement to predetermine the number of clusters as we did in the K-Means algorithm.
The hierarchical clustering technique has two approaches:
- Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all data points as single clusters and merging them until one cluster is left.
- Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down approach.
Agglomerative Hierarchical clustering
The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group the datasets into clusters, it follows the bottom-up approach. It means, this algorithm considers each dataset as a single cluster at the beginning, and then start combining the closest pair of clusters together. It does this until all the clusters are merged into a single cluster that contains all the datasets.
This hierarchy of clusters is represented in the form of the dendrogram.
How the Agglomerative Hierarchical clustering Work?
Step-1: Create each data point as a single cluster. Let’s say there are N data points, so the number of clusters will also be N.
Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there will now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the below images:
Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide the clusters as per the problem.
Measure for the distance between two clusters
As we have seen, the closest distance between the two clusters is crucial for the hierarchical clustering. There are various ways to calculate the distance between two clusters, and these ways decide the rule for clustering. These measures are called Linkage methods. Some of the popular linkage methods are given below:
Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider the below image:
Complete Linkage: It is the farthest distance between the two points of two different clusters. It is one of the popular linkage methods as it forms tighter clusters than single-linkage.
Average Linkage: It is the linkage method in which the distance between each pair of datasets is added up and then divided by the total number of datasets to calculate the average distance between two clusters. It is also one of the most popular linkage methods.
Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is calculated. Consider the below image:
Woking of Dendrogram in Hierarchical clustering
The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean distances between the data points, and the x-axis shows all the data points of the given dataset.
The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in agglomerative clustering, and the right part is showing the corresponding dendrogram.
- As we have discussed above, firstly, the datapoints P2 and P3 combine together and form a cluster, correspondingly a dendrogram is created, which connects P2 and P3 with a rectangular shape. The hight is decided according to the Euclidean distance between the data points.
- In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is higher than of previous, as the Euclidean distance between P5 and P6 is a little bit greater than the P2 and P3.
- Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and P4, P5, and P6, in another dendrogram.
- At last, the final dendrogram is created that combines all the data points together.
Let’s start implementing using python in Google collab.
Import libraries
import pandas as pd
import numpy as np
from google.colab import files
uploaded = files.upload()
Load the dataset
import io
train_data = pd.read_csv(io.StringIO(uploaded['Mall_Customers.csv'].decode('utf-8')))
Displaying the first 5 records from the dataset
Determine the X values
Taking annual income and spending score.
X= train_data.iloc[:,3:].values
X
Finding the optimal number of clusters using Dendrogram
Now we will find the optimal number of clusters using the Dendrogram for our model. For this, we are going to use scipy library as it provides a function that will directly return the dendrogram for our code.
from matplotlib import pyplot as mtp
import scipy.cluster.hierarchy as shc
dendro = shc.dendrogram(shc.linkage(X, method="ward"))
mtp.title("Dendrogram Plot")
mtp.ylabel("Euclidean Distances")
mtp.xlabel("Customers")
mtp.show()
Using this Dendrogram, we will now determine the optimal number of clusters for our model. For this, we will find the maximum vertical distance that does not cut any horizontal bar. So, the optimal number of clusters will be 5, and we will train the model in the next step, using the same.
Applying hierarchical clustering model
from sklearn.cluster import AgglomerativeClustering
hc= AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
y_pred= hc.fit_predict(X)
y_pred
The clustering takes the following parameters,
- n_clusters=5: It defines the number of clusters, and we have taken here 5 because it is the optimal number of clusters.
- affinity=’euclidean’: It is a metric used to compute the linkage.
- linkage=’ward’: It defines the linkage criteria, here we have used the “ward” linkage. This method is the popular linkage method that we have already used for creating the Dendrogram. It reduces the variance in each cluster.
Visualization
mtp.scatter(X[y_pred == 0, 0], X[y_pred == 0, 1], s = 100, c = 'blue', label = 'Cluster 1')
mtp.scatter(X[y_pred == 1, 0], X[y_pred == 1, 1], s = 100, c = 'green', label = 'Cluster 2')
mtp.scatter(X[y_pred == 2, 0], X[y_pred == 2, 1], s = 100, c = 'red', label = 'Cluster 3')
mtp.scatter(X[y_pred == 3, 0], X[y_pred == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
mtp.scatter(X[y_pred == 4, 0], X[y_pred == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
mtp.title('Clusters of customers')
mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1-100)')
mtp.legend()
mtp.show()
DBSCAN clustering
Fundamentally, all clustering methods use the same approach i.e. first we calculate similarities and then we use it to cluster the data points into groups or batches. Here we will focus on Density-based spatial clustering of applications with noise (DBSCAN) clustering method.
Density-Based Clustering refers to unsupervised learning methods that identify distinctive groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region of high point density, separated from other such clusters by contiguous regions of low point density.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for density-based clustering. It can discover clusters of different shapes and sizes from a large amount of data, which is containing noise and outliers.
Algorithm Parameters
The DBSCAN algorithm uses two parameters:
- minPts: The minimum number of points (a threshold) clustered together for a region to be considered dense.
- eps (ε): A distance measure that will be used to locate the points in the neighborhood of any point.
These parameters can be understood if we explore two concepts called Density Reachability and Density Connectivity.
Reachability in terms of density establishes a point to be reachable from another if it lies within a particular distance (eps) from it.
Connectivity, on the other hand, involves a transitivity based chaining-approach to determine whether points are located in a particular cluster. For example, p and q points could be connected if p->r->s->t->q, where a->b means b is in the neighbourhood of a.
There are three types of points after the DBSCAN clustering is complete:
- Core — This is a point that has at least m points within distance n from itself.
- Border — This is a point that has at least one Core point at a distance n.
- Noise — This is a point that is neither a Core nor a Border. And it has less than m points within distance n from itself.
How it works?
- Pick an arbitrary data point p as your first point.
- Mark p as visited.
- Extract all points present in its neighborhood (upto eps distance from the point), and call it a set nb
- If nb >= minPts, then
a. Consider p as the first point of a new cluster
b. Consider all points within eps distance (members of nb) as other points in this cluster.
c. Repeat step b for all points in nb - else label p as noise
- Repeat steps 1–5 till the entire dataset has been labelled ie the clustering is complete.
After the algorithm is executed, we should ideally have a dataset separated into a number of clusters, and some points labelled as noise which do not belong to any cluster.
Example
These points may be better explained with visualizations.
In this case, minPts is 4. Red points are core points because there are at least 4 points within their surrounding area with radius eps. This area is shown with the circles in the figure. The yellow points are border points because they are reachable from a core point and have less than 4 points within their neighbourhood. Reachable means being in the surrounding area of a core point. The points B and C have two points (including the point itself) within their neighbourhood (i.e. the surrounding area with a radius of eps). Finally, N is an outlier because it is not a core point and cannot be reached from a core point.
Implementation using Python
Import libraries
Copyimport numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
Applying DBSCAN algorithm
CopyX, y = make_blobs(n_samples=500,n_features=2,
centers=4, cluster_std=1,
center_box=(-10.0, 10.0),
shuffle=True, random_state=1)
X = StandardScaler().fit_transform(X)
y_pred = DBSCAN(eps=0.3, min_samples=30).fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=y_pred)
print('Number of clusters: {}'.format(len(set(y_pred[np.where(y_pred != -1)]))))
print('Homogeneity: {}'.format(metrics.homogeneity_score(y, y_pred)))
It is clear that 4 clusters are detected as expected. The black points are outliers or noise according to the DBSCAN model. Since the dataset being analyzed has been generated by us and we know the ground truths about it, we can use metrics like the homogeneity score (checking if each cluster only has members of one class) and the completeness score (checking if all members of a class have been assigned the same cluster).
Hope you get a better idea on Hierarchical Clustering & DBSCAN.
Thanks for reading, I hope you enjoyed it!
Reference