Unsupervised Clustering Algorithms

Smruti Ranjan Pradhan
AlmaBetter
Published in
5 min readApr 25, 2021

What is Clustering ?

Clustering is the process of assembling a set of observations under different groups based on some similarity criteria. Clustering is very essential in Machine learning since it helps to categorize our data set and helps us to understand our data well without the help of any sort of labeling of data. These algorithms do this by identifying underlying patterns in the feature set. However, different algorithms take different approach towards clustering the data points. The most common and widely used clustering algorithm is K-Means Clustering. But discussing about the algorithms in detail, let me introduce you to the method by which we measure the similarity between the observations

Similarity criteria

The similarity between the observations in the data set is generally done by distance metrics. There are three well-known and widely used distance metrics, namely, Manhattan, Euclidean and Minkowski distance.

If (x1, x2, …, xn) and (y1, y2, …, yn) are two given data points/ observations :

Manhattan Distance :

Euclidean Distance :

Minkowski Distance with order of norm = p :

The smaller is the value of a distance metric, the higher is the similarity between the two observations.

K-Means Clustering

The idea behind the algorithm is to group all the points to one of the K clusters with K cluster centroids. The entire clustering algorithm is based on the concept of Expectation-Maximization.

Steps involved in the algorithm :

  1. Randomly initialize k cluster centers/ centroids
  2. E-step : Calculate the distance of all the observations from each of the cluster centroids and assign the observations to the cluster to which the particular observation has the minimum distance.
  3. M-step : Calculate the mean/ median of all the observations in a cluster and reassign the cluster centroid to the mean/ median value
  4. Repeat steps 2 and 3 until there is no further reassignment of observations to any alternative cluster

However, the most important downside of this algorithm is that the algorithm is prone to find local optimum based on the starting points of cluster centroids. An effective solution to this issue is to repeat the algorithm for multiple times and take the results of that particular run which has the minimum Sum Squared Error (SSE). Sum squared error is defined as the sum of the distances of all the observations from their respective cluster centers.

One other down-side of this algorithm is that it tends to form globular clusters and are not suitable for grouping clusters with non-globular structures. To address these issues, we have other clustering algorithms known as hierarchical clustering algorithms.

Hierarchical Clustering

These clustering algorithms are based on the concept of hierarchy. It tries to find a hierarchical pattern in the unlabeled data. There are two kinds of hierarchical clustering :

  1. Agglomerative Clustering : This algorithm takes a bottom up approach and takes each observation in the data set belonging to a unique cluster and then tries to club them based on some similarity criteria until the algorithm appropriate number of clusters.
  2. Divisive Clustering : This algorithm takes a top down approach in which all the observations are considered as one cluster and then the clusters are divided based on the most dissimilar clusters/ observations.

What is Proximity Matrix ?

A matrix that stores the distances between each points. The distance can be based on any distance metric. If we have n observations in our data set, proximity matrix is of size n x n, with diagonal elements equal to 0.

Steps involved in Agglomerative Hierarchical clustering :

  1. Each observation is assigned to a unique cluster
  2. We look for the smallest distance between two points/clusters in the proximity matrix and club the points/ clusters with the minimum distance between them and update the proximity matrix accordingly. The distance can be calculated based on any one of the different kind of linkages, namely, single, complete, average linkages.
  3. Repeat the above step up until we get a single cluster.

Types of Linkages :

  • Single-linkage: In this linkage method, the distance between two clusters is defined as the shortest possible distance between any two points of opposite clusters. Single linkage clustering is best to use when we try to find non-globular clustering pattern in data
  • Complete-linkage: In this linkage method, the distance between two clusters is defined as the longest possible distance between any two points of opposite clusters. Complete linkages are robust to possible noise present in the data between two clusters.
  • Average-linkage: In this linkage method the distance between two clusters is defined as the average of all distances calculated between each point in one cluster and every other point in the other cluster. Average linkage is to robust to possible noise in the data.

What is a Dendrogram ?

A Dendrogram is a pictorial representation of the hierarchical tree formed as a result of hierarchical clustering. It aptly represents the clusters formed at different stages of the algorithm. It also gives a rough idea about the pint at which we need to terminate algorithm in order to get the right number of clusters.

Advantages of Hierarchical Clustering

  1. Unlike K-means clustering, hierarchical clustering does not take number of clusters as input. Te algorithm is smart enough to recognize the right number of clusters in the data.
  2. Unlike K-means clustering, this algorithm can find non-globular clustering patterns in data in addition to globular ones.

However the biggest disadvantage of this algorithm is that, the performance of this algorithm degrades with increasing number of observations in the data set, since it becomes computationally expensive to calculate the huge n x n proximity matrix.

--

--

Smruti Ranjan Pradhan
AlmaBetter

Data Scientist at Accenture | Python & ML Expert | Researcher | IISER-K Alumna