Everything to know about Hierarchical Clustering; Agglomerative Clustering & Divisive Clustering.

Chandra Prakash Bathula
16 min readJun 25, 2023

Hierarchical Clustering.

Hierarchical Clustering:

  • Since, we have already learnt “ K- Means” as a popular clustering algorithm.
  • The other popular clustering algorithm is “Hierarchical clustering”.

Before we go on to its applications & benefits.To remember we have two types of “Hierarchical Clustering”. They are :

1.Agglomerative Hierarchical Clustering (Typically Most Popular) and

2.Divisive Hierarchical clustering.

Let’s get some understanding about these two types what they actually mean. So that we will get an idea of the hierarchical part of the hierarchical clustering.

=> Let’s start with an example.

Suppose, we have a bunch of points. The way Agglomerative works is as follows:

  • First, it assumes every point as a cluster itself.
  • Later it clusters a near point and groups them to 2 point cluster.
  • Next, it will continue for the near point and groups them.
  • Ideally, it reaches to two clusters as a whole.
  • As the bounding case it reaches to a single one-large cluster at the end.
Agglomerative Hierarchical Clustering.

Each point being one cluster itself at the beginning.

Next it groups points with the clusters based on some sense of similarity (or) distance.

=> This is the core-idea of Agglomerative clustering.

=> As we go down for each Cluster at the end we reach to a single-large cluster.

Grouping until single large cluster is formed.

=> This is called Agglomerative; because we are starting with very Small clusters being just a

“Agglomerative” means group together .We are actually grouping together points (or) clusters to build larger clusters.

“ Each of these clusters have larger number of points. This is how agglomerative works it starts with individual points and groups together.

** On the other hand, Divisive works exactly opposite.

=> Divisive, does the exact opposite & let’s see how it works exactly opposite.

Let’s see how it works :

=> Divisive starts off by saving that everything belongs to One containing all the points in a single cluster.

  • In second iteration, it breaks the Single cluster into two small clusters. Divisive tries to divide larger cluster into smaller clusters.
  • In the third iteration, it further breaks down into smaller clusters.
  • So on it continues, until every point, becomes a cluster itself.
Divisive Hierarchical Clustering
  • Divisive starts with one big- cluster with all the prints and breaks down, into

Smaller clusters continues till every point becomes a cluster one cluster for each point

=> Agglomerative is the other way around.

* Hence, Agglomerative is the most popular typically is studied and implemented and studied.

  • One problem with divisive is how to divide the clusters.

How to divide the clusters is the big question ?

Divisive is seldom used and people typically use Agglomerate because trying to group based on similarity or distance is easier then trying to break thing. This is not saying it is hard (or) impossible. But agglomerative is what more popular.

=> So, these are the both agglomerative and divisive hierarchal clustering.

Now, I hope, you understood why it is called “Hierarchical Clustering”.

Because if we think about it there is a hierarchy in this entire process.

Let’s see how it, the “hierarchy” looks like.

But remember that the key ingredient in (or) for “Agglomerative clustering” to work is a similarity measure (or) a distance measure between not just points but between clusters too.

  • =>If we are given two clusters, we should be able to”quantitatively”, (or) measure the similarity (or) the distance. Because we are grouping similar clusters (or) near by clusters. We need to have a sense of similarity between clusters. Once, we have this, this algorithm is Straightforward.

* The sequences of points in a cluster can be recorded like a tree and hence this is called “Hierarchy”. And this tree in data-miring and in clustering is often referred to as a “Dendrogram”.

Dendrogram is nothing but a tree which records the sequence of merges in case of agglomerative clustering (or) splits in the case of Divisive Clustering.

=> One thing we can notice from this structure is when we do this “Agglomerative Clustering” (or) “Divisive Clustering” we never measured the no. of clusters. Once we build the hierarchy we can have the clusters and we can select the wanted clusters from the merges and splits.

  • We can get the No.of clusters we want and we don’t give the no.of clusters as the hyper parameter and we can stop wherever we want.
  • So, if we recall, we have the no.of clusters as hyper-parameter in K-Means. In Agglomerative, we don’t give no.of Clusters as hyper-parameter.
  • Oncе, we get the “Agglomerative Clustering” we can see visually the clusters and increase (or) decrease as per the ideal case.
  • > This is not with k-means case.

Example:

If we have k=3, then we get C1, C2, C3 as clusters and пow, if we want k=7, then we need to re-train to get the following clusters.There is no way get the clusters organically (or) easily, we need to redo to get the desired output. In the case of “Agglomerative” it is simple for getting the no.of clusters 2 clusters -> 5 clusters, if we want to go deeper and deeper, very very easily without having to re-compute any thing.

***It is the “ most important property of “Hierarchical Clustering” which makes it more popular and a good alternative to k-Means.***

Agglomerative Clustering Algorithm:

  • The first step is to compute a proximity matrix, which means in the given points : P1,P2,P3,….,Pn
  • We have to compute this matrix, in which the values of similarity should be contained between the points.

=> Proximity Matrix = Similarity Matrix = Distance Matrix.

  • Proximity, basically means how close things are, how similar things are. Without this we can’t do anything. With this it is very, very powerful in computing this agglomerative clustering algorithm.

Now, we will have a loop as follows:

Merge the two-closest clusters.

and repeat with the update proximity matrix until only a Single cluster remains.

We ignore the diagonal values as the distance is from itself from the point. We need to look at the off diagonal values in this matrix.

  • If the distance between two points is small then we merge them (or) cluster them. Once you merge them into one cluster then we need to update the proximity matrix.
  • Now, let’s see how to update the proximity Matrix.
  • Initially you have individual points P1,P2,P3….Pn
  • After merging points we have the following Proximity.

*** The key operation here is how to compute the proximity of two clusters similarity (or) distance between two clusters.***

=> Once we figure this out , the whole algorithm falls in place. And also we need to be careful on updating the proximity matrix.

=> First, we have points and corresponding Proximity Matrix.

Initial Points and Proximity Values.

After merging in some steps, we have clusters.

Initial clusters and inter cluster values.

After Few iterations we will get the some more clusters after merge closest clusters (or points).

Merge of Clusters.

Intermediate Situation:-

Now, if we have to (Cluster) merge two clusters namely C3 & C4 and update the proximity matrix. Initially it will look as above.

Updating of Proximity Clusters after merging.

And the is big, question is “How we update the Proximity Matrix ?’

So, which ever clusters you merge, the corresponding row and column should be updated in the Proximity Matrix.

=> And yet again how do you define the Similarity.

How do you define Inter-Cluster Similarity ?

=> There are multiple approaches to compute. They are:-

1. MIN.

2. MAX.

3. Group -Average.

4. Distance between Centroid.

5. Ward’s Method.

The only key idea is how do you measure the similarity between two clusters. Once you figure it out the algorithm is a cake walk.

To make it simple, it is simply the set unions.

Defining Inter-Cluster Similarity:

Suppose, if we are given a cluster “C1” with one (or) more points and in the other cluster “C2”. How do we define similarity (or) proximity between points.

=> There are multiple approaches which are simple ideas. They are:-

Min:-

  • Min basically says if we have two clusters C1 & C2 with points P1 to P5 in C1 and P6 — P10 in C2. Take the distance (or) similarity between the clusters as follows. It is calculated using the following equation:

Sim(C1,C2) = min [Sim(Pi,Pj)]

Such that Pi belongs to C1 & Pj belongs to C2.

(or) d_min(C_i, C_j) = min(dist(a, b) for a in C_i, b in C_j)

  • In other words, The minimum distance between two clusters is defined as the shortest distance between any two points belonging to the two clusters. This metric measures the similarity between the closest points from different clusters.
Minimum computation.
MIN Dendrogram.

This is how we define similarity using the MIN approach.

Max is the exact opposite.

MAX:-

  • This metric calculates the maximum distance between any pair of data points belonging to different clusters. It represents the maximum dissimilarity between clusters and indicates the presence of well-separated clusters.

Sim(C1,C2) = max [Sim(Pi,Pj)]

Such that Pi belongs to C1 & Pj belongs to C2.

(or) d_max(C_i, C_j) = max(dist(a, b) for a in C_i, b in C_j).

Maximum Computation.
Max dendrogram.

This is how max computation works in “Agglomerative Heirarchical Clustering” works.

Group-Average:

  • This metric computes the average distance between all pairs of data points belonging to different clusters. It provides a measure of the average dissimilarity between clusters.

Average Intergroup Distance = (Sum of all pairwise distances between data points in different clusters) / (Total number of pairwise distances)

Mathematically, if we have N clusters and the ith cluster has ni data points, the total number of pairwise distances between data points in different clusters is given by:

Total number of pairwise distances = n1 * (N — n1) + n2 * (N — n2) + … + ni * (N — ni) + … + nN * (N — nN)

where n1, n2, …, nN represent the number of data points in each cluster.

(or)

“Average Intergroup Distance=nmi=1n​∑j=1md(ci​,cj​)​.”

where:

  • d(ci​,cj​) represents the distance between two cluster centroids ci​ and cj​,
  • n is the number of clusters,
  • m is the number of clusters excluding ci​ (to avoid summing the same cluster pair twice).
Group Average.
Group Average Dendrogram.

This is how Group Average Computation works.

Distance between Centroids:

With all the points in a cluster we can compute their centroid and by this we need to have both cluster centroids and take their similarity as the similarity between clusters. And this is less popular. Computed as follows:

Centroids distances.

***One important thing to note here is that all we need is similarity of points in every computation. Except for the centroid!***

  • Whenever we need Similarity between points, we can trivially kernalize it.

=> Sim (Pi, Pj → kernalize It.

Instead of this Similarity Matrix. We can choose any other kernel matrix. Because centroid is a news point altogether

It is slightly harder to kernalize but it is possible theoretically.

Min, Max, Group Average can be trivially kernalized. Because all they need is similarity values.

Comparison Of all the Computation Techniques:

Comparison of each computation technique.

Strength of MIN:

MIN Strength.
  • It can handle non-elliptical shapes.

Limitations:

Limitation Of Min.
  • It is highly sensitive to noise and outliers.

There is a tendency of combining noisy points here in the case of MIN as it is using minimum distance between points and it merges the outliers. Extremely sensitive to noise and outliers.

Strength Of MAX:-

Strength Of Max.

• Less susceptible to noise and outliers.

Limitations:

Limitations Of Max.
  • It breaks the clusters that are large.
  • And also biased towards globular clusters.

MIN is also called as “Single Linkage Algorithm”.

MAX is also called as “Complete Linkage Algorithm”.

Group Average:

It is a compromise between Min and Max.

Strength of Group Average:

  • It has the same strength as MAX. Less susceptible to noise and outliers.

Limitations:

  • Biased towards globular clusters. This may or may not work in the case of breaking larger clusters based on the density of the points.

And there is one more method for the cluster similarity calculation called “Ward’s Method”.

Ward’s Method:

  • Basically it is also similar to group average except the fact that it has distance squared.
  • Ward’s method aims to minimize the increase in the total within-cluster variance when merging clusters.
  • It calculates the distance between two clusters based on the increase in the total within-cluster sum of squares (variance) after merging.
  • The equation for calculating the distance between two clusters A and B using Ward’s method is:

=> d(A, B) = sqrt((n_A * n_B) / (n_A + n_B)) * ||c_A — c_B||²

where n_A and n_B are the number of samples in clusters A and B, and c_A and c_B are the centroids of clusters A and B, respectively.

The clusters with the smallest increase in within-cluster variance are merged together.

Strength Of Ward’s Method:

  • It is also less susceptible to noise and outliers.

Limitations:

  • Based towards globular clusters.

Sometimes people often even use this to initialize K-Means.

=> All of them are used in Agglomerative Clustering.

Space & Time Complexity for Hierarchical Clustering:-

Space Complexity: -

It is roughly O (n²) = > n: no. of data pts.

=> Because we need to store similarity matrix (or) distance matrix.

=> It is a lot when “n” is large.

Time Complexity:

=> It is roughy O(n³).

Utmost n -iterations.

  • Because at each iteration. we ave grouping a cluster.
  • Utmost, we will have n-iterations till we get 1 large cluster at the end with all the points.
  • At each iteration we need to update similarity-matrix.

Updating Similarity matrix and re-storing it is roughly about ~ O(n2).

=> 0 (n) * 0 (n² )

=> ~ O (n³).

“ This is not the exact value as we can optimize the updating and re-storing computations to get the final time-complexity.”

  • We can make it O (n² logn).

By using complex data-structures we can further break it down.

Typically, Space = O (n²)

And,

Time = 0 (n² logn)., but if you brute force implement it, it will be 0(n³).

One thing we need to remember is

O(n²) ; O(n²logn); O (n³) is quite large complexities.

Because ”n” can be large in data.

Imagine if given 1 million pts ; 0(n²) = 10⁶ * 10⁶ => 10¹²

O(n²) bytes => 1000 gb = more than 1Tb for the above and we need it in RAM Storage which is impractical.

One big problem of ,”Agglomerative Clustering” is, it is not very useful when the data is very large. Because of it’s time and Space complexity.

Limitations of Hierarchical clustering:

1. There is no mathematical objective function (or) optimization function for that we are directly solving.

Ex:. If you look at it in k-Means :-

We are solving, a very clear mathematical function.

And hence, once you can represent things in to a clear Mathematical function (or) optimization problem we can connect these techniques with other algorithms.

This Agglomerative is a neat algorithmic solution but does not fit neatly into a , mathematical function (or) formulation because it has its own limitations.

=> K-Means can connect to other known techniques in applied Mathematics and optimization like Matrix Factorization

But we can’t connect Hierarchical clustering to any of the existing techniques very easily. There are ways to connect it but it is not straight forward and easy.

2.Take any computation in Agglomerative be it Min, Max (or) Group-Average they have their own limitations.

*Min:- outliers.

*Max:— breaks large clusters, can’t accommodate different sized clusters

Grouping : It was some limitations as it is a combination of Min and Max.

3. * * * The most important limitation from a practical stand point (or) applied stand point is “Space & Time Complexity”. * * *

It has a terrible complexity of Space &, Time complexity.

  • Space => O(n²) quadratic
  • Time: O(n³)=> cubic for brute force ;O(n²logn) in between quadratic and cubic if optimization is done to an extent.

So, it can’t be used when “n” is large which makes it hard for large scale data sets. for small scale data sets it might work but it does not fits well with large-scale dataset.

In case of large-scale data sets, Agglomerative (or) general Hierarchical clustering goes out of the door and a very good alternative is K-Means” as it has the following complexity:

=> K-Means : “O (nd).” -> linear. ~ O(n) when “d” is small

*** All other limitations that can be worked around but complexity Cannot be done anything.***

So, there is a lot of investment in k- Means as it is old and researchers trying to improve.

=> (Initially from 1960’s)K- Means => ( Have Nice Extensions Later) :: K-Means ++ (Early 2000's.) ; kernel K- Means ; k- Medoids.

=> Because of it’s huge time and,Space complexity Hierarchical clustering did not see any improvement (or)development.

“Agglomerative Clustering is a powerful technique with a wide range of applications across various domains.”

Here are some applications and benefits of Agglomerative Clustering:

Applications Of Agglomerative Clustering in General:

1. Customer Segmentation: Agglomerative Clustering can be used to segment customers based on their behavior, preferences, or purchasing patterns. This segmentation enables businesses to better understand their customer base and tailor marketing strategies accordingly.

2. Image Segmentation: Agglomerative Clustering can partition an image into distinct regions based on similarity criteria. It is commonly used in computer vision applications, such as object detection, image recognition, and image compression.

3. Document Clustering: Agglomerative Clustering can group similar documents together based on their content. This is useful in text mining, information retrieval, and document organization tasks.

4. Anomaly Detection: Agglomerative Clustering can identify outliers or anomalies in a dataset by considering them as separate clusters. This is valuable in fraud detection, network intrusion detection, and quality control.

5. Biological Data Analysis: Agglomerative Clustering is widely used in bioinformatics for analyzing gene expression data, protein sequence analysis, and clustering biological samples based on their characteristics.

Benefits:

1. Flexibility: Agglomerative Clustering can handle various types of data, including numerical, categorical, and mixed data. It can also accommodate different distance metrics and linkage methods, providing flexibility in analyzing different datasets.

2. Hierarchy Exploration: Agglomerative Clustering creates a hierarchy of clusters, represented by a dendrogram. This allows for easy exploration of different levels of granularity in the clustering structure.

3. Interpretability: Agglomerative Clustering produces clusters that are easy to interpret and understand. The hierarchical structure and visualizations aid in identifying meaningful patterns and relationships within the data.

4. Scalability: Agglomerative Clustering can handles small datasets efficiently. The time complexity of the algorithm is manageable, especially when using optimized implementations.

5. No Prespecified Number of Clusters: Unlike some other clustering algorithms, Agglomerative Clustering does not require the number of clusters to be specified beforehand. It automatically determines the number of clusters based on the data and the chosen linkage method.

Agglomerative Clustering can be highly useful in scenarios where the underlying data structure is hierarchical, and the goal is to identify meaningful clusters at different levels. It is particularly beneficial when there is no prior knowledge about the number of clusters or when the number of clusters can vary depending on the granularity of analysis. Additionally, the ability to visually explore the clustering hierarchy aids in gaining insights and making informed decisions based on the data.

Here is the implementation of the Agglomerative Clustering.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

# Load the dataset
df = pd.read_csv('wine-clustering.csv')

# Extract features
X = df.iloc[:, 1:].values

# Perform Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
labels = agg_clustering.fit_predict(X)
# Print the labels
print("Cluster Labels:")
print(labels)

# Create linkage matrix
linkage_matrix = linkage(X, method='ward')

# Plot the dendrogram
plt.figure(figsize=(12, 6))
dendrogram(linkage_matrix)
plt.title('Dendrogram')
plt.xlabel('Samples')
plt.ylabel('Distance')
plt.show()
Cluster Labels for the Wine Data.
# Box plots
plt.figure(figsize=(10, 6))
sns.boxplot(x=labels, y=df['Alcohol'])
plt.title('Box Plot - Alcohol')
plt.xlabel('Cluster Labels')
plt.ylabel('Alcohol')
plt.show()

plt.figure(figsize=(10, 6))
sns.boxplot(x=labels, y=df['Malic_Acid'])
plt.title('Box Plot - Malic Acid')
plt.xlabel('Cluster Labels')
plt.ylabel('Malic Acid')
plt.show()
Box Plots For the features Alcohol & Malic Acid with Cluster Labels.
# Violin plots
plt.figure(figsize=(10, 6))
sns.violinplot(x=labels, y=df['Alcohol'])
plt.title('Violin Plot - Alcohol')
plt.xlabel('Cluster Labels')
plt.ylabel('Alcohol')
plt.show()

plt.figure(figsize=(10, 6))
sns.violinplot(x=labels, y=df['Malic_Acid'])
plt.title('Violin Plot - Malic Acid')
plt.xlabel('Cluster Labels')
plt.ylabel('Malic Acid')
plt.show()
Violin Plots for Alcohol & Malic Acid with Cluster Labels.
# Plot the clusters in 2D
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('Agglomerative Clustering - 2D')
plt.xlabel('Alcohol')
plt.ylabel('Malic Acid')
plt.show()
Agglomerative Clustering 2D Visualization.
# Visualize clusters in 3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=labels)
ax.set_xlabel('Price')
ax.set_ylabel('Fuel Efficiency')
ax.set_zlabel('Acceleration')
plt.title('Agglomerative Clustering (3D)')
plt.show()
Agglomerative Clustering 3D Visualization.

Key Words:

#HierarchicalClustering #AgglomerativeClustering #ClusteringAlgorithm #MachineLearning #ArtificialIntelligence #DataClustering #PatternRecognition #UnsupervisedLearning #DataMining #DataAnalysis #ClusterAnalysis #DataVisualization #BigData #FeatureExtraction #DataPreprocessing #DataScience #Algorithm #DataPoints #Dendrogram #SimilarityMeasure #DataExploration #DataClusteringTechniques #DataClassification #DataModelling #DataPatterns #ClusterValidation #IntraclusterDistance #InterclusterDistance #ScikitLearn #PythonProgramming

References:

Pictures of Strengths, limitations and individual dendrograms and values are taken from:

Ala Al-Fuqaha (“AL”), Ph.D ;

Professor and Director,

NEST Research Lab

College of Engineering & Applied Sciences

Computer Science Department & Editor, IEEE Communications Letters

Click Here: https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf

BECOME a WRITER at MLearning.ai // text-to-video // Divine Code

--

--