A Comparative Study of Clustering Algorithms

ishika chatterjee
Analytics Vidhya
Published in
7 min readOct 13, 2019

Clustering is basically defined as division of data into groups of similar objects. Each group called a cluster consists of objects that are similar between themselves and dissimilar compared of other groups. Lets compare among different type of clusters. The algorithms under discuss are: k-means algorithm, hierarchical clustering algorithm, self organizing maps algorithm, and expectation maximization clustering algorithm.

Comparison Metrics:

Now I would like to decide the factors on which I would discuss the comparison amongst the clustering algorithms:

  1. size of dataset
  2. number of clusters
  3. type of dataset and type of software used
  4. performance of the algorithm
  5. accuracy of the algorithm
  6. quality of the algorithm

How are algorithms implemented?

LNKnet Software: It is public domain software made available from MIT Lincoln Laboratory.

Cluster and tree view Software: cluster and tree view Software are programs that provide a computational and graphical environment for analyzing data found from different data sets.

The reasons behind choosing this software is that They are the most popular software for implementing several data clustering algorithms.

comparison

Now I will start discussing individually on the clustering algorithms in detail and also show codes about how to implement it.

1 — K-means:

working of k-means

K-means is a well known partitioning method. Objects are classified as one of the k groups , k chosen as priory. Cluster membership is determined by calculating the centroid of each group and assigning each object to the group with closest centroid. This approach minimizes the overall within cluster dispersion by iterative reallocation of cluster members.

In a general sense, a k-partioning algorithm takes as input a set S of objects and an integer k and outputs a partition of S into subsets S1, S2, S3…Sk. It uses the sum of squares as the optimization criterion Let xri be the rth element of Si, |Si| be the number of elements in Si and d(xri,xsi) be the distance between xri and xsi.

Algorithm:

Step 1: Choose K as the number of clusters.

Step 2: Initialize the codebook vectors of the K clusters (randomly, for instance)

Step 3: For every new sample vector: Compute the distance between the new vector and every cluster’s codebook vector.Recompute the closest codebook vector with the new vector, using a learning rate that decreases in time.

Time Complexity and Space Complexity:

Its time complexity is O(nkl), where n is the number of patterns, k is the number of clusters, and l is the number of iterations taken by the algorithm to converge.Its space complexity is O(k+n). It requires additional space to store the data matrix.It is order-independent; for a given initial seed set of cluster centers, it generates the same partition of the data irrespective of the order in which the patterns are presented to the algorithm.

How to Implement K-means?

We have sklearn to implement k-means.

Code Sample of K-Means:

from pandas import DataFrame
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
#making a toy dataset
Data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46],
'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7]
}
#converting into dataframe
df = DataFrame(Data,columns=['x','y'])

kmeans = KMeans(n_clusters=3).fit(df)#applying kmeans on the dataset #with number of clusters=3
centroids = kmeans.cluster_centers_
print(centroids)

plt.scatter(df['x'], df['y'], c= kmeans.labels_.astype(float), s=50, alpha=0.5)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50)

Here in the above code snippet i am using kmeans algorithm to explain how to use kmeans in code .The application of kmeans is pretty simple. sklearn has full supply. Here what i have done is i have made a toy dataset and applied kmeans on it.

Output:

2 — Hierarchical Clustering Algorithm:

working of hierarchical clustering

Partitioning algorithms are based on specifying an initial number of groups, and iteratively reallocating objects among groups to convergence. In contrast, hierarchical algorithms combine or divide existing groups creating a hierarchical structure that reflects the order in which groups are merged or divided. In an agglomerative method, which builds the hierarchy by merging, the objects initially belong to a list of singleton sets S1,…, S2, Sn. Then a cost function is used to find the pair of sets {Si, Sj} from the list that is the “cheapest” to merge. Once merged, Si and Sj are removed from the list of sets and replaced with Si U Sj. This process iterates until all objects are in a single group. Different variants of agglomerative hierarchical clustering algorithms may use different cost functions.Complete linkage, average linkage, and single linkage methods use maximum, average, and minimum distances between the members of two clusters, respectively.

Algorithm:

Step 1 — Compute the proximity matrix containing the distance between each pair of patterns. Treat each pattern as a cluster.

Step 2 — Find the most similar pair of clusters using the proxitmity matrix .Merge these two clusters into one cluster.

Step 3 — If all patterns are in one cluster, stop. Otherwise, go to step 2

Advantages:

• Embedded flexibility regarding a level of granularity.

• Ease of handling of any forms of similarity or distance.

• Consequently applicability to any attributes types.

Hierarchical clustering algorithms are more versatile.

Time Complexity and Space Complexity:

Time complexity = O(n³) where n is the number of data points.

Space complexity = O(n²) where n is the number of data points.

How to Implement Hierarchical Clustering Algorithm?

from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=3, affinity=’euclidean’, linkage=’ward’)
cluster.fit_predict(df)
plt.figure(figsize=(10, 7))
plt.scatter(df[‘x’], df[‘y’], c=cluster.labels_)

I have applied hierarchical clustering algorithm on the same toy data set. with three clusters.

Output:

3 — Self-Organization Map Algorithm

Inspired by neural networks in the brain, Self- Organization Map(SOM) uses a competition and cooperation mechanism to achieve unsupervised learning. In the classical SOM, a set of nodes is arranged in a geometric pattern, typically 2- dimensional lattice. Each node is associated with a weight vector with the same dimension as the input space. The purpose of SOM is to find a good mapping from the high dimensional input space to the 2-D representation of the nodes. One way to use SOM for clustering is to regard the objects in the input space represented by the same node as grouped into a cluster.

Algorithm:

Let’s discuss briefly about the algorithm.

Advantages:

  • SOMs simplify clustering and allow the user to identify homogenous data groups visually. In Viscovery, several clustering algorithms (SOM Single Linkage, Ward, and SOM-Ward) are available for automatically building clusters. It is very easy to understand and it works pretty well.

Time Complexity:

T=O(NC)=O(S^2)

How to Implement SOM

I am using SOM on the same toy data set.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import somoclu
data = np.float32(df)
n_rows, n_columns = 100, 160
som = somoclu.Somoclu(n_columns, n_rows, data=data)
som.train()
som.view_component_planes()

Output:

4 — The Expectation Maximization Clustering Algorithm

Lets start discussing about this algorithm. Expectation Maximization(EM) is a well-established clustering algorithm in the statistics community. EM is a distance-based algorithm that assumes the data set can be modeled as a linear combination of multivariate normal distributions and the algorithm finds the distribution parameters that maximize a model quality measure, called log likelihood.

Algorithm:

Let’s start to discuss about the algorithm steps.

Step 1 :Given a set of incomplete data, consider a set of starting parameters.

Step 2: Expectation step (E — step): Using the observed available data of the dataset, estimate (guess) the values of the missing data.

Step 3: Maximization step (M — step): Complete data generated after the expectation (E) step is used in order to update the parameters.

Step 4: Repeat step 2 and step 3 until convergence.

Advantages:

  • It has slow convergence.
  • It makes convergence to the local optima only.
  • It requires both the probabilities, forward and backward (numerical optimization requires only forward probability).

Time Complexity:

O(m.n³), where m is the number of iterations and n is the number of parameters.

How to implement EM Algorithm?

As expectation maximization algorithm is a Gaussian mixture model.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3)
gmm.fit(data)
plt.scatter(df['x'], df['y'])

With this I would like to conclude the comparative study of the algorithm. Hope I have given a clear comparative study of the algorithms. THANKS A LOT for reading!

References:

--

--