A Study of the Hierarchical Clustering: Unsupervised Machine Learning

Elias Hossain
Analytics Vidhya
Published in
5 min readSep 25, 2020
Source: https://unsplash.com/photos/OgvqXGL7XO4

Another popular method of clustering is hierarchical clustering. I have seen in K-minus clustering that the number of clusters needs to be stated. Hierarchical clustering does not require that. Hierarchical clustering What comes before our eyes is that some long lines are forming groups among themselves. In hierarchical clustering, such a graph is called a dendrogram. We will know a little later what this dendrogram is.

This article will be discussed the pipeline of Hierarchical clustering. Let’s get started….

Fig.1: Types of Hierarchical clustering

Hierarchical clustering is of two types, Agglomerative and Divisive. The details explanation and consequence are shown below.

Divisive: In this method, the complete dataset is assumed to be a single cluster. That cluster is then continuously broken down until each data point becomes a separate cluster. It works by following the top-down method.

Agglomerative: Agglomerative is the exact opposite of the Divisive, also called the bottom-up method. In this method, each data point is initially treated as a separate cluster. Then on the basis of the distance of these clusters, small clusters are formed with them, thus these small clusters again form large clusters. See (Fig.2) to understand the difference between the top and bottom down approach.

Fig.2: Agglomerative and Divisive approach | Image source: TowardsDataScience

Agglomerative clustering can be done in several ways, to illustrate, complete distance, single distance, average distance, centroid linkage, and word method. Let’s see the explanation of this approach:

Complete Distance — Clusters are formed between data points based on the maximum or longest distances.
Single Distance — Clusters are formed based on the minimum or shortest distance between data points.
Average Distance — Clusters are formed on the basis of the minimum or the shortest distance between data points.
Centroid Distance — Clusters are formed based on the cluster centers or the distance of the centroid.
Word Method- Cluster groups are formed based on the minimum variants inside different clusters.

Real-life application of Hierarchical clustering:

  • Classify animals and plants based on DNA sequences.
  • Epidemics caused by various viruses.

Let’s Implement the Hirecial Clustering on top Wholesale data which can be found in Kaggle.com: https://www.kaggle.com/binovi/wholesale-customers-data-set

Let’s import the essential library:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sns
from sklearn.preprocessing import normalize
import scipy.cluster.hierarchy as shc
from sklearn.cluster import AgglomerativeClustering

Load the dataset

url='df1= pd.read_csv("C:/Users/elias/Desktop/Data/Dataset/wholesale.csv")'
df = pd.read_csv(url)
df.head()

We will normalize the whole dataset for the convenience of clustering.

data_scaled = normalize(df)
data_scaled = pd.DataFrame(data_scaled, columns=df.columns)
data_scaled.head()

After calling the dataset, you will see the image look like Fig.3:

Fig.3: Scaling the dataset

Creating a dendrogram of a normalized dataset will create a graph like Fig. 4. We have created this dendrogram using the Word Linkage method. We can create dendrograms in other ways if we want. From this dendrogram it is understood that data points are first forming small clusters, then these small clusters are gradually becoming larger clusters.

plt.figure(figsize=(10, 7))  
plt.title("Dendrograms using Ward")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
plt.show()
Fig.4: Dendrogram by word linkage approach

Looking at the dendrogram Fig.4, we can see that the smaller clusters are gradually forming larger clusters. Data points on the X-axis and cluster distance on the Y-axis are given. The maximum distance for the two largest clusters formed by the blue line is 7 (no new clusters have been formed since then and the distance has not increased). We have drawn a line for this distance, for the convenience of our understanding. Let’s make the dendrogram using another approach which is Complete linkage:

plt.figure(figsize=(10, 7))plt.title("Dendrograms using Complete")dend1 = shc.dendrogram(shc.linkage(data_scaled, method='complete'))plt.show()
Fig.5: Dendrograms using complete

Let’s make the dendrograms by using a Single linkage:

plt.figure(figsize=(10, 7))plt.title("Dendrograms using Single")dend2 = shc.dendrogram(shc.linkage(data_scaled, method='single'))plt.show()
Fig.6; Dendrograms using Single

Now we will make it by using Average:

plt.figure(figsize=(10, 7))plt.title("Dendrograms using Average")dend3 = shc.dendrogram(shc.linkage(data_scaled, method='average'))plt.show()
Fig.7: Dendrograms using Average

We will now look at the group by the mean value of a cluster, so that we understand what kind of products are sold on average in which cluster.

agg_wholwsales = df.groupby(['cluster_','Channel'])['Fresh','Milk','Grocery','Frozen','Detergents_Paper','Delicassen'].mean()
agg_wholwsales
Fig.8: Report of Products sold

To conclude, this article illustrates the pipeline of Hierarchical clustering and different type of dendrograms. Hierarchical clustering is very important which is shown in this article by implementing it on top of the wholesale dataset. This article shows dendrograms in other methods such as Complete Linkage, Single Linkage, Average Linkage, and Word Method.

References:

  1. https://towardsdatascience.com/machine-learning-algorithms-part-12-hierarchical-agglomerative-clustering-example-in-python-1e18e0075019
  2. https://www.analyticsvidhya.com/blog/2019/05/beginners-guide-hierarchical-clustering/
  3. https://towardsdatascience.com/hierarchical-clustering-in-python-using-dendrogram-and-cophenetic-correlation-8d41a08f7eab

If you desire to find my recent publication then you can follow me at Researchgate or LinkedIn.

Researchgate: https://www.researchgate.net/profile/Elias_Hossain7

LinkedIn: https://www.linkedin.com/in/elias-hossain-b70678160/

--

--

Elias Hossain
Analytics Vidhya

I am a Software Engineer. My research interest is diverse, intelligent systems, and I am eager to learn more about them