Unsupervised Learning: Hierarchical Clustering and DBSCAN
A gentle Introduction with An Implementation in Customer Dataset
There are lots of methods to group our data points in machine learning for further analysis based on similarity. As a data scientist or a data analyst, we know this method called clustering. The most common clustering algorithm that used is K-means Clustering Algorithm. It is so popular because this algorithm is so simple and powerful. It computes the centroids and iterates until we find an optimal centroid to group the data points with the number of groups is represented by K.
Yet, I am not going to discuss about the K-means. I am going to explain other clustering algorithms such as Hierarchical Clustering and DBSCAN. Some of you might already know this two algorithms, but for me, it is something new and I am obsessed with it. Anyway, Let’s start to the first one!!
Hierarchical Clustering
This algorithm is also popular like the K-means algorithm. In the K-means clustering, we need to define the number of clusters or K before doing the clustering while in Hierarchical Clustering, it is the algorithm itself that automatically find the potential number of clusters for us to decide by using a tree shape from called Dendrogram.
What is Dendrogram?
A dendrogram is a diagram that shows the hierarchical relationship between objects. By using this we could see the number of clusters that we are going to use for the next steps.
We can see that there are two variables in the last diagram. The first one is the data points itself (sample index) and the second one is distance. Distance in this graph is the representation of Euclidean distance. From the diagram, we can see that there are one big clusters in overall. However, if we cut the the distance in 60, there would be only three clusters (the green one, the red one(number 3, 4 and 2), and the last red one). We should cut the distance because if the number of distance is to big, the data points in clusters might have no certain similarities.
There are two of hierarchical clustering techniques:
1. Agglomerative Hierarchical clustering
It is a bottom-up approach, initially, each data point is considered as a cluster of its own, the similar data points or clusters merge as one in further iteration of finding clusters until one cluster or K clusters are form.
2. Divisive Hierarchical clustering (DIANA)
In contrast, DIANA is a top-down approach, it assigns all of the data points to a single cluster and then split the cluster to two least similar clusters. We proceed recursively on each cluster until there is one cluster for each observation.
Implementation
- Visualise the dendrogram of our data
#Check the dendrogram
plt.figure(figsize = (12, 8))
dendogram = sch.dendrogram(sch.linkage(X_pc, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Distance')
plt.show()
2. Use Agglomerative Clustering to Cluster the data
3. Append the result of clustering in new dataset
Result:
Why Hierarchical Clustering?
It has an advantage of not having to pre-define the number of clusters gives it quite an edge over algorithm such as K-Means. Yet, it does not work well when we have huge amount of data.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Now as we already talked about Partitioning method (K-means) and hierarchical clustering, we are going to talked about density-based spatial clustering of applications with noise (DBSCAN) method. Mostly, clustering algorithms work to look for spherical-shaped clusters and severely affect by the presence of noise and outlier in the data. Thus, the result of clustering is sometimes not quite good for real life data which contains lots of noise.
Here, the DBSCAN come to solve this issue. Clusters are dense regions in the data space, separated by regions of the lower density of points. The DBSCAN is based on this intuitive notion of “clusters” and “noise”. The key idea is that for each point of a cluster, the neighbourhood of a given radius has to contain at least a minimum number of points.
DBSCAN Parameters
1. Eps
Eps is the maximum radius of the neighbourhood. It defines the neighbourhood around a data point i.e. if the distance between two points is lower or equal to ‘eps’ then they are considered as neighbours. One way to find the eps value is based on the k-distance graph.
If the eps value is too small then large part of the data will be considered as outliers. If it is very large then the clusters will merge and majority of the data points will be in the same clusters.
2. MinPts
MinPts is minimum number of points in an Eps-neighbourhood (within eps radius) of that point. Larger the dataset, the larger value of MinPts must be chosen. Arule of thumb is to derive minPts from the number of dimensions D in the data set. minPts >= D + 1. For 2D data, take minPts = 4.
Core: a point that has at least m points within distance n of itself.
Border: a point that has at least one core point at a distance n.
Noise: a point that is neither a core nor a border. It just like an outlier. It has less than m points within distance n from itself.
That are the 3 types of data points that we have in this algorithm.
Implementation
- Fit the data to DBSCAN algorithm
#Use DBSCAN, -1 value means outliers
dbscan = DBSCAN(eps = 10, min_samples = 5)
y_pc_db = dbscan.fit_predict(X_pc)
y_pc_db
Result:
array([-1, 0, -1, 0, -1, 0, -1, -1, -1, 0, -1, -1, -1, 0, -1, 0, -1,
0, -1, -1, -1, 0, -1, 0, -1, 0, -1, -1, -1, 0, -1, 0, -1, -1,
-1, 0, -1, 0, -1, 0, -1, -1, 1, 1, -1, -1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, -1, 2, -1, 2, 1, 2, -1, 2, -1, 2, -1, 2, -1, 2,
-1, 2, -1, 2, -1, 2, -1, 2, -1, 2, -1, 2, -1, 2, 3, 2, 3,
2, -1, 2, -1, 2, -1, 2, -1, 2, -1, 2, -1, 2, 3, 2, 3, -1,
3, 2, -1, 2, -1, 2, -1, 2, -1, 2, -1, 2, -1, 2, -1, 2, -1,
-1, -1, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
You can notice that the result of clustering has value of -1. These values are the noise in the dataset.
2. Append the result of clustering in new dataset
Result:
Remember before, we have -1 values, those blue points is the-1 values (the noise). Here, we have four clusters. We can check if it is true using code down below.
target = X_pc_db['cluster']
print(target.nunique()) # number of clusters
Result:
4
Why DBSCAN?
It can automatically detect the number of clusters based on your input data
and parameters. DBSCAN can handle noise and outliers. All the outliers will be identified and marked without been classified into any cluster. Therefore, DBSCAN can also be used for Anomaly Detection (Outlier Detection). However, this algorithm is still difficult for heavy-cluttered dataset.
I am planing to explain another clustering algorithm which combining these two method, hierarchical and DBSCAN clustering. Can you imagine how powerful that algorithms will be?
Let me give you a hint, the name of this algorithm is start with H and there is the word of DBSCAN on in. H stands for Hierarchical and DBSCAN is for, you know, the DBSCAN Algorithm. So, it will publish not really soon as I need to get the mood to write that. But hopefully, you have something to take away after reading this explanation about Hierarchical Clustering and DBSCAN.