Why so many different clustering algorithms?

Published in

SFU Professional Computer Science

13 min readFeb 4, 2020

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit sfu.ca/computing/pmp.

Authors: Mohammad Javad Eslamibidgoli, Olawunmi Oluwatoba Kolawole, Shahram Akbarinasaji, Ruihan Zhang, Ke Zhou

How many clusters do you see in this starling murmuration?

Cluster analysis is an unsupervised learning task that aims to divide objects into groups based on their similarity. So many different clustering algorithms have been developed to achieve this goal, but each of them only captures some aspects of the clusters. These aspects include the size, shape, density, hierarchy, overlapping or disjointness of cluster, as well as detecting outliers or noise. Thus, there is no ‘true’ clustering algorithm. In this blog post, we briefly review the most important clustering methods with more emphasis on their pros and cons.

1) Hierarchical clustering:

Let us assume we have a computer, and we look at our home folder. We can see that there are subfolders with different names like downloads, applications, movies, and music to name a few. Each of these subfolders has items and subfolders in them that relate to the parent folder and each other in some way. For example, our music folder might have subfolders of different genres in which there are folders of different bands and in those folders are albums grouped by year. We can consider this system of arranging folders and subfolders a form of hierarchical clustering.

The goal of hierarchical clustering is to construct a hierarchy of similar objects — a tree diagram where the whole list of objects is at the root and the child branches represent different groupings. This tree diagram is called a dendrogram. There are two main approaches to this: agglomerative or divisive.

Steps in the agglomerative (bottom-up) clustering algorithms:
1) Treat each object in the dataset as a separate cluster.
2) Identify two similar clusters.
3) Merge them into one cluster.
4) Repeat 2 and 3 until only a single cluster remains.

Figure 1) Hierarchical clustering (figure adopted from here)

A measure of similarity can be achieved by using distance functions, e.g. Euclidean distance:

The distance function takes two entities as input and returns an output. The smaller this output (closer distance value), the higher the similarity becomes. There are a variety of other distance measures such as Manhattan distance and Chebychev distance. The chosen distance metric depends on the domain of study. For example, if we are trying to group blog posts, then the distance could be a function of how many important words these blog posts share.

To merge clusters, we also need to know a distance measure between the clusters a.k.a linkage criteria. There are five main linkage measures:

Single-Linkage: the distance between the closest objects in two clusters.
Complete-Linkage: the distance between the farthest objects in two clusters.
Average-Linkage: an average of all pairwise distances between objects in two clusters.
Centroid-Linkage: the distance between the centroids of two clusters.
Ward’s distance: A measure of the change in the total distance from a formed centroid upon joining two clusters.

Just like choosing a distance metric, choosing linkage criteria should depend on the domain knowledge of the application.

The divisive (top-down) clustering algorithm is the opposite of the agglomerative algorithm; here, instead of merging similar objects from leaf nodes, we start with all points in one cluster at the root and repeatedly split these points into two partitions so that it results in the highest dissimilarity. In this approach, we choose the clusters with a higher variance to split at each iteration. To determine how to split these clusters, we need to compute the pairwise distances between all pairs of objects with an appropriate distance function and construct a dissimilarity matrix. This matrix is then searched to find the most distant pairs of objects (most dissimilar ones) in order to find the optimum split. This naive approach, however, is expensive and limited to small datasets.

Even though hierarchical algorithms are easy to understand and implement, they have some drawbacks:

Hierarchical algorithms do not have rollback procedures. This means that a mistake cannot be undone without restarting the process from the beginning.
If we have a very large dataset, the dendrogram can become confusing and hard to read. That is, it will not be easy to determine the number of clusters visually.
Its time complexity (how fast it can process the result) is relatively worse than other solutions, like k-means for example.

2) DBSCAN:

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that finds sets of all dense clusters in a dataset. Intuitively, in areas that are ‘dense’, objects are ‘connected’ to each other and dense areas are surrounded by sparse areas. DBSCAN determines a set of core objects and border objects and the remaining objects are predicted as noise. In theory, density-based clustering needs to satisfy two conditions:

Maximality: for pairs (p,q), if p belongs to a cluster and q is density reachable from p, then q also belongs to that cluster. In other words, for each cluster object, the local density exceeds some threshold.
Connectivity: for every pair (p,q) in a cluster, they need to be density connected. There should be no gap or sparse area separating p and q. In other words, the set of all objects of one cluster must be spatially connected.

Theoretically, DBSCAN is developed based on the definition of some terms. Given a set of objects in dataset, a distance function, a sphere of radius epsilon (this parameter is known as epsilon-neighborhood, Eps) around an object, and a minimum number of objects in epsilon-neighborhood (this parameter is known as MinPts or density threshold), we can define core objects, density reachable objects, border objects, density connected objects and noise objects as follows:

Core object: if a sphere with radius Eps is centered at a core object, it contains at least MinPts number of objects. In Figure 2, given the MinPts as 4 the core objects are shown as red.
A directly density reachable object lies within the epsilon neighborhood of a core object which forms a chain of reachable objects. In Figure 2, these points are shown as red and yellow.
Border object which is not a core object but density reachable from a core object. Yellow circles in Figure 2 represents border objects.

Figure 2) DBSCAN (figure adopted from here)

Density connectivity: If we restrict ourselves to core objects, direct density reachability is a symmetric relationship as shown by double-sided arrows in Figure 2. However, in general density reachability is not a symmetric relationship as no object is density reachable from a non-core object; for example, in Figure 2, one border object is not density reachable from the other border object. In order to define a symmetric distance function, the concept of density connectivity is introduced in DBSCAN: Two border objects are density connected if both are density reachable from the third object, i.e. a core object.
Noise which are objects that do not belong to any found clusters. The blue circle in Figure 2 represents a noise object.

Advantages of density-based clustering algorithms include:

It detects clusters of arbitrary shapes.
It is robust to noise.
It is efficient.
The number of clusters is not an input parameter unlike representative-based models (will be discussed next).
If we start from a random arbitrary core object before finding all density reachable objects, we always end up with the same cluster. Thus, the order in which we call the objects in the dataset has no impact on the clustering results.

However, DBSCAN cannot deal with hierarchical clusters. It struggles with significantly different densities within different areas of data. In this situation, clusters and noise cannot be easily separated. To address this issue, hierarchical density-based clustering methods such as OPTICS and HDBSCAN were developed. Other improvements to DBSCAN are generalized DBSCAN, GDBSCAN or Rough-DBSCAN.

3) Partitional clustering:

In partitional clustering, clusters are initialized by k representatives. Thus, the dataset is initially broken into k groups. Each object is then assigned to its most similar cluster representative while a cost function is minimized iteratively. The two main methods in partitional clustering are k-means (based on the construction of cluster centroids) and k-medoids (based on a selection of cluster medoids):

In k-means, the cost function is defined as the sum of the variances around the centroids:

The steps involved in this algorithm are as follows:

1) Choose k number of clusters.
2) Randomly initialize k centroids.
3) Calculate the distance between every data point and the k centroids then assign each data point to the cluster with the smallest variance.
4) Compute the average for all the points in each cluster and set them to the new k centroids.
5) Repeat steps 3 and 4 to find k best centroids until convergence or reaching the predefined number of iterations

Figure 3) k-means algorithm (figure adopted from here)

Although efficient and simple, initialization and choice of k are not straightforward in many cases. Moreover, k-means only predicts spherical clusters. There are a variety of post-processing methods designed to improve the clustering results based on k-means. ISODATA, for example, eliminates small clusters predicted by k-means or splits larger clusters to smaller ones. In this method, the minimum or maximum size of the clusters need to be specified as new parameters.

In k-medoids (or partitioning around medoids, PAM), the medoids are considered as the representatives of the clusters. Therefore, the medoid is an element of the cluster that minimizes the sum of distances from the other cluster objects to that chosen representative. The steps in the k-medoids algorithm are:

1) Choose k number of clusters.
2) Select k initial cluster representative medoids from the dataset.
3) Associate each data point to the closest medoid.
4) Swap each medoid with each data point and compute the cost. Update the medoid for the pair that results in the largest reduction of the cost function.
5) Repeat steps 3 and 4 to find the best k medoids until convergence or reaching the predefined number of iterations.

K-medoids is more flexible than k-means in terms of using similarity measures, thus it can be applied to mixed data with numerical and categorical attributes. K-medoids is also more robust to noise and outliers than k-means. However, it is much more expensive to compute because it needs to consider all possible swaps between medoid objects and their corresponding pairwise dissimilarities. Other algorithms like CLARANS, was developed to scale up k-medoids by introducing an additional parameter for randomly choosing pairs.

4) Expectation Maximization (EM) clustering:

EM clustering is probabilistic model-based clustering, basically a generalization of k-means. In k-means we only use the centroid as the representative of a cluster. In EM clustering, the cluster representative is described by a probability density function (typically a Gaussian distribution), so the cluster is represented by a mean and a covariance matrix of the points in it. The covariance matrix shows the spread of the cluster in different directions.

In EM clustering, every point is soft assigned to all the clusters using a probability distribution, unlike in k-means where a point is only assigned to one of the clusters (hard assigned).

In theory, EM clustering assumes that data points have been generated independently and from identical distributions (iid assumption). With this assumption, EM uses maximum likelihood estimation to find optimal parameters for assigning a data point to a specific cluster. The steps in the EM clustering algorithm are:

1) Initialize

2) E-step: Evaluate

and compute

3) M-step: Maximize

4) Evaluate log likelihood and check for convergence; If not converged

and go to step 2.

Intuitively, clusters that have a higher a priori, should have a higher probability of being sampled for the generated points. In comparison to k-means, EM clustering provides more information due to the soft assignment of data points to clusters. It is also able to predict non-spherical clusters, unlike k-means. However, the initialization phase is harder to execute. This is caused by a larger number of required parameters in the algorithm. It is also significantly slower than k-means due to the large number of iterations required to optimize objective functions.

5) Non-negative Matrix Factorization (NMF):

NMF is a dimensionality reduction method suitable for the approximation of sparse and non-negative matrices. At first glance, this method may not be related to clustering problems, but they can converge when we dig deeper. NMF has been used in different applications ranging from image recognition to recommender systems. Images are an example of a non-negative matrix where each element’s value ranges between 0 and 255. Utility matrices in recommendation systems is another example of a non-negative sparse matrix. For instance, Figure 4 shows each user rating to different movies:

Figure 4) Utility matrix showing each user rating to each movie

Given a data matrix X with m rows and n columns where every element of X is greater than or equal to zero, NMF searches for two smaller matrices: W (with m rows and k columns) and H (with k rows and n columns) such that

Where all elements of these matrices are greater than or equal to zero. In the above example, W represents the user matrix and H represents the item matrix.

Figure 5 shows this multiplication:

Figure 5) Approximation of X using W and H matrices

Matrix W is called a basis matrix that represents clusters of attributes and matrix H is called a coefficient matrix. The intention here is to categorize data in X into k factors. In practice, the factorization rank k is chosen such that

NMF is essentially an optimization problem of the form

where D is the loss function and can take different formulas. The most common one is the Frobenius norm given by

In this case we are searching for values of W and H such that the difference between X and its approximation W.H is minimized.

All methods to solve NMF problems are iterative. These include alternating least squares, multiplicative update, projected gradient descent, active set, optimal gradient, and block principal pivoting.

Generally, the steps to solve NMF are as follows:

1) Set random values for matrices W and H.
2) Change values of matrices W and H in each iteration to get as close as possible to X.

In Alternating Least Squares these steps are as follows:

1) Initialize values of W and H.
2) Considering that we know the right values of W, search for the values of H that minimizes

3) Keep H fixed and seek for values of W.
4) Repeat steps 2 and 3 until convergence.

Even though NMF has received attention because it automatically extracts meaningful clusters from data, there are some drawbacks:

NMF is NP-hard which means that it can not be solved in polynomial time.
Finding the global minimum is not guaranteed and it only searches for the local minima, although this aspect is useful in some applications.
Choosing the best factorization rank is non-trivial.

Conclusion

We reviewed a few clustering algorithms, but there are many more depending on the problem we wish to solve. For example, subspace clustering and projected clustering methods have been developed to treat clustering issues associated with high-dimensional data. Semi-supervised clustering is another active area of research to approach problems where some constraints are provided by supervision (or domain knowledge). Constrained k-means clustering or probabilistic models for semi-supervised clustering based on Hidden Markov Random Fields are examples of semi-supervised methods. Cluster ensembles (consensus clustering or multi-view clustering) is another approach to combine multiple clustering techniques to create a more comprehensive result. Model-based ensembles and data-based ensembles are the two main methods in this area.

Clustering validation is hard; if ground truth class labels are available, clustering validation is possible by constructing a confusion matrix, measuring purity, entropy, or calculating the Rand index. However, ground truth labels are not always available. On the other hand, internal validation methods include measuring the compactness of the clusters and their separation from each other, calculating Silhouette coefficients to measure how similar an object is to its own cluster compared to other clusters, or computing the data likelihood for probabilistic model-based methods.

Overall, each algorithm captures some aspects of the clusters, thus, different clustering algorithms can lead to substantially different results for the same dataset. It is generally hard to tell which clustering algorithm is the best and there is no ‘true’ clustering solution.

References

This blog post is influenced by CMPT 741 lectures, Data Mining course, taught by Prof. Martin Ester, SFU, Fall 2019.
Hierarchical Clustering: Jain, Anil K., and Richard C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.
DBSCAN: Ester, Martin, et al. “A density-based algorithm for discovering clusters in large spatial databases with noise.” Kdd. Vol. 96. №34. 1996.
K-means: MacQueen, James. “Some methods for classification and analysis of multivariate observations.” Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Vol. 1. №14. 1967.
Probabilistic modeling: Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. “Maximum likelihood from incomplete data via the EM algorithm.” Journal of the Royal Statistical Society: Series B (Methodological) 39.1 (1977): 1–22.
Non-negative matrix factorization: Lee, Daniel D., and H. Sebastian Seung. “Algorithms for non-negative matrix factorization.” Advances in neural information processing systems. 2001.
A. C. Turkmen, “A review of nonnegative matrix factorization methods for clustering,” arXiv Prepr. arXiv1507.03194, 2015.
Semi-supervised clustering: Basu, Sugato, Mikhail Bilenko, and Raymond J. Mooney. “A probabilistic framework for semi-supervised clustering.” Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 2004.
Subspace clustering: Agrawal, Rakesh, et al. “Automatic subspace clustering of high dimensional data for data mining applications.” Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 1998.

Why so many different clustering algorithms?

Written by Shahram Akbarinasaji