Data Science (Python) :: Hierarchical Clustering

Sunil Kumar SV
3 min readAug 9, 2017

--

Intention of this post is to give a quick refresher (thus, it’s assumed that you are already familiar with the stuff) on “ Hierarchical Clustering”. You can treat this as FAQ’s or Interview Questions as well.

How many types of Hierarchical clustering are there?

There are 2 types :: Agglomerative (Bottom up approach), Divisive (Top down approach)

*****************************************

What’s the working behind Hierarchical clustering?

First step is to make each data point a single point cluster. Second step is to take 2 closest clusters and make them as one cluster. Third step is to take 2 closes clusters and make them as one cluster. Repeat step 3 until only one cluster remains.

*****************************************

How do dendograms work?

Dendogram is the plotting of euclidean distance between each clusters. As per the explanation in ‘working behind hierarchical clustering’, we plot points in X-axis and euclidean distance on y-axis. And then we start plotting as per the working of hierarchical clustering. Please read the wikipidea article for better understanding. https://en.wikipedia.org/wiki/Dendrogram

*****************************************

By looking at the dendogram, how can you know how many clusters to have?

In a dendogram, look for the largest vertical line which doesn’t cross any horizontal line. Use this line to draw a horizontal line and then, the points where this horizontal line crosses various vertical lines, count those points and that count is the ideal answer for the number of clusters the data can have.

*****************************************

Sample code for plotting a dendogram?

#Using the dendogram to find the optimal number of clusters
import scipy.cluster.hierarchy as sch
dendogram = sch.dendrogram(sch.linkage(X, method = ‘ward’))
plt.title(‘<Title for X-axis>’)
plt.xlabel(‘<Title for Y-axis>’)
plt.ylabel(‘Euclidean distance’)
plt.show()

*****************************************

Sample code for implementing Agglomerative Clustering?

# fitting hierarchical clustering to the mall dataset
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = ‘euclidean’, linkage = ‘ward’)
y_hc = hc.fit_predict(X)

# Visualising the clusters
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = ‘red’, label = ‘Cluster 1’)
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = ‘blue’, label = ‘Cluster 2’)
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = ‘green’, label = ‘Cluster 3’)
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = ‘cyan’, label = ‘Cluster 4’)
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = ‘magenta’, label = ‘Cluster 5’)
plt.title(‘Clusters of customers’)
plt.xlabel(‘Annual Income (k$)’)
plt.ylabel(‘Spending Score (1–100)’)
plt.legend()
plt.show()

***************************************

K-Means clustering depends on WCS (within cluster suqares). Similarily, what does Hierarchical clustering depends on what?

Within cluster variance.

*****************************************

Which clustering performs better for large datasets, Hierarchical or K-Means?

K-Means

****************************************

What can be used to find the ideal number of clusters in Hierarchical clustering?

Dendograms

****************************************

Advantages and disadvantages of K-Means clustering and Hierarchical clustering.

K-Means Clustering
Advantages ::
> Simple to understand
> Easily Adapptable
> Efficient & Performant
> Works well on both small and large datasets

Disadvantages ::
> Need to choose the number of clusters

Hierarchical Clustering
Advantages ::
> We can obtain the optimal number of clusters from the model itself
> Dendograms helps us in visualising, which is practical and easy to understand

Disadvantages ::
> Not suitable for large datasets

*****************************************

Prev :: Data Science (Python) :: K-Means Clustering

If you liked this article, please hit the ❤ icon below

--

--

Sunil Kumar SV

#ProductManager #TechEnthusiast #DataScienceEnthusiast #LoveToSolveProblemsUsingTech #Innovation