Clustering Analysis: Hierarchical Clustering

Bhanu Kiran
8 min readMar 16, 2022

--

So you have data, you have gone ahead and performed your EDA, data cleansing, etc, but you realize your data does not have a target variable! It’s too late to notice at this point, but that’s not what this blog is about. This blog is about cluster analysis and a particular method used to perform a cluster analysis.

Cluster analysis is part of unsupervised learning methods, where, unlike supervised learning in which your data has a target variable, and tries to identify the relationship between the features and target, cluster analysis is about finding the hidden structures in your data. If I was to plot a simple scatter plot between my variables, it might not show any trend nor does it make sense. In other words, at a first glance, it might look random. But, after a cluster analysis, you’ll be able to see the difference.

A Scatter plot of the “Wine” dataset
A plot of the “Wine” dataset after cluster analysis

A cluster is a group of objects, therefore a cluster analysis focuses on grouping similar data points together. How does this grouping between data points occur? This is done by the similarity and dissimilarity measures!

Similarity and Dissimilarity measures

There might not be an absolute definition for what a similarity measure is and what a dissimilarity measure is, as this varies with the type of data at hand. But if you look at the literal definition it gives us a better idea.

As the name suggests, similarity means two things that are related to each other and dissimilarity means two things that are not similar to each other.

In cluster analysis, we have something known as a similarity coefficient which can be denoted by

s(x,y)

where x and y are two data points that indicate the strength of the relationship between two data points. And a dissimilarity coefficient is expressed as

d(x,y)

where x and y are two data points.

In most cases or generally, instead of finding the similarity between two data points, we check for the dissimilarity d(x,y) as it is expressed as a distance measure. Where the farther the distance between two data points, the different they are, and of course they might belong to different clusters.

Keep in mind that the distance measure follows the following properties:

  1. non-negative: d(x,y) ≥ 0
  2. reflexivity: d(x,y) = 0 < = > x=y
  3. commutative: d(x,y) = d(y,x)
  4. triangle inequality: d(x,y) ≤ d(x,z) + d(x,z)

Now that you understand the term distance measure, let’s look at the various distance measures used in cluster analysis. The way a clustering method performs clustering of data points depends on the type of distance measure used.

Distance Measures

While clustering, any of the distance measures can be used. As discussed earlier, a distance measure calculates the distance between two points. And in clustering, while the result is the same no matter what measure or method is used(you get clusters), the way the result is obtained varies due to the different methods.

Generally, distance measures are followed by something known as linkage methods. And while performing clustering and applying it in real-time, a user can specify the type of distance measure and linkage method they want to use to perform clustering. And data clustering is usually carried out as simply as

data clustering = distance + method

Since data can be numerical or categorical, below, you can find the various distance measures and linkage methods, after which we will jump into Hierarchical clustering.

For numerical data

  1. Euclidian Distance

2. Average Distance

3. Manhattan Distance

4. Maximum Distance

Minkowski Distance

For categorical data

  1. Simple Matching distance
  1. Jaccard

Linkage Methods

The linkage methods first calculate the distance between the clusters, and the closest of them are combined into a cluster.

  1. Single Linkage - clusters with the closest minimum distance are merged
  2. Complete Linkage - clusters with the closest maximum distance are merged
  3. Centroid Linkage - clusters with the lowest centroid distance are merged
  4. Ward’s Linkage - clusters are merged based on their error sum of square values.

Now that all the math, formulas, and background jazz is out of the way, let’s look at the juicy stuff, hierarchical clustering. In summary, a distance measure is used to calculate the distance between two points, and a linkage method is used to link to combine the closest pair of data points into a single cluster.

Hierarchical Clustering

Hierarchical clustering is one of the most widely used clustering methods. This cluster analysis can be visualized with the help of a tree-like structure referred to as a dendrogram.

A dendrogram

The main algorithm for hierarchical clustering is the agglomerative algorithm, and it’s quite simple!

Agglomerative Clustering

Take a moment, and observe the diagram above, as you might have guessed if not already, this is a bottom-up strategy and it starts with each data point as an atomic cluster. These atomic clusters are then merged into larger and larger clusters until we end up with one single cluster.

The atomic clusters are first merged based on the dissimilarity between them, i.e, the distances between all the atomic clusters are calculated and then clusters are made with the points which have the least distance between them, and this process is repeated all over again.

To put this into an algorithm:

  1. Create an initial set of clusters with each cluster consisting of a single record for all records in data
  2. Computer dissimilarity D(Ci,Cj) between all pairs of clusters i and j
  3. Merge two clusters Ci and Cj that are least dissimilar as measured by D(Ci,Cj)
  4. If we have more than one cluster remaining, return to step 2. Otherwise, we are done.

Keep in mind, that the formulas and algorithms above might seem daunting, but it’s simple really because nowadays you have your machines to do everything for you, you just have to tell it what to do. And the reason you need to know what’s going on behind the scenes is so that you can apply it correctly when the time comes. Now let’s get into it.

Hierarchical Clustering using R and Python

Now that all the theory is aside, and hopefully, you now understand how hierarchical clustering works, let’s put it to use. The dataset being used is the famous wine dataset and you find it here. Our aim here is to perform a cluster analysis on this dataset, and we shall do this in both R and Python.

R

Let’s start by importing libraries and the dataset.

library(tidyverse)
library(cluster)
df <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")

Once you run the summary function, you should see the summary statistics of all the columns in the wine dataset.

summary(df)

Since all the columns are different scales, we can normalize them using the scale() function.

df <- scale(df)

Now, as we have seen above. Clustering data depends on what distance we use and what method we use.

data clustering = distance + method

Therefore, I shall use a complete linkage method.

The hclust() function in R performs the agglomerative hierarchical cluster. We first computer the dissimilarity values with the dist() function by passing our data frame and distance measure and then feed these values into the hclust() function. After which we plot the dendrogram.

dc <- dist(df, method = "euclidean")
hc_df <- hclust(dc, method = "complete")
plot(hc_df, hang= -0.1)

Pretty easy right?! Now let’s do it in python.

Python

Let's start by importing libraries, and there is a bunch of libraries to import.

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as shc
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram
from sklearn.decomposition import PCA

Now we import the dataset.

df = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data')

Now we get summary statistics just like how we did in R.

df.describe()

Again, we scale our features to normalize them.

sc = StandardScaler()
df_scaled = sc.fit_transform(df)

Now, in python, this goes a little different from R. You can easily plot your dendrogram by using the following code for hierarchical clustering.

plt.figure(figsize =(20, 10))
plt.title('Hierarchical Dendrogram')
Dendrogram = shc.dendrogram((shc.linkage(df_scaled, method ='complete')))

Now, if I want to use the agglomerative model from the scikit learn and visualize my clusters, it’s usually done in scatterplots, which technically only consist of an X-axis and Y-axis. therefore we reduce the dimension of our data frame to two columns using PCA.

pca = PCA(n_components = 2)
X_principal = pca.fit_transform(df_scaled)
X_principal = pd.DataFrame(X_principal)
X_principal.columns = ['P1', 'P2']

Now I can use the AgglomerativeClustering() function to build a model and plot it.

agg_cluster = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean', linkage = 'complete')plt.figure(figsize =(6, 6))
plt.scatter(X_principal['P1'], X_principal['P2'],
c = agg_cluster.fit_predict(X_principal), cmap ='rainbow')
plt.show()

And there you go! Hierarchical Clustering. I hope you now have a better understanding of hierarchical clustering. If you would like to read more about it then you can follow this book which here which I have used for reference — Practical Statistics for Data Scientists.

--

--