Clustering: two approaches

Published in

CodeX

3 min readApr 29, 2022

What is clustering? It is the procedure of finding separate groups within the data if those exist. Such groups are clusters. They are part of unsupervised modeling.

We use unsupervised modeling in Machine Learning (ML) when we need to find general patterns in the data and do not have any particular target. We may have just some independent variables.

Note: it is a crucial point that we use clustering as the unsupervised ML method because supervised ML methods aim to predict the value of the target based on independent variables we know (class, outcome, label, or dependent variable).

How to cluster? Two approaches exist:

1. Hierarchical clustering

That is the process when we repeat merging clusters, which are represented by every data point till they arrive at a single one. The result is a cluster hierarchy.

Some experts state that the best way to represent hierarchical clustering is to create a dendrogram. It is a tree-like type of visualization. Dendrogram has branches, and their lengths correspond to the distances between clusters.

It has advantages as it offers an insight into the number of clusters within the data to some extent. We can get it with a specific algorithm.

We set the cut-off at a certain point where the distances between clusters begin growing more rapidly. Then we should define the distances between two clusters of data instances. We can choose different types of distances:

complete linkage (the farthest pair of points);
single linkage (the closest points pair);
average linkage (average points distance);
Ward linkage (a measure based on intra-cluster variance).

Many scholars choose in favor of the last one. They have some principal arguments for that, but we would not go deeply here because it is a separate topic for discussion. If you would like to know more about this detail, let me know in the comments section of this article.

2. K-means clustering

K-means clustering is another approach to the clustering process. We should set the clusters’ number (k) to start. That method works by putting k-centroids in a random position, it assigns each data point to the closest one, and the k-means approach creates clusters.

The algorithm moves each k-centroid to the corresponded cluster’s center. Then it reassigns every data point according to the new centroids’ positions. And it repeats the centroids’ movement once again.

Difference between methods

Two methods described above (hierarchical and K-means clustering) differ in several crucial aspects.

K-means clustering computes Euclidean distances between coordinates, while hierarchical clustering starts from the distances matrix that does not necessarily correspond to the observable coordinates.

The advantage of the hierarchical clustering method is that it provides a representative visualization. It can help in the quality of clustering estimation and the clusters’ number determination. The k-means doesn’t give such an opportunity.

The K-means method’s advantage is the speed. It is mostly faster, except the cases of suboptimal positions. To get the best result here, we restart the procedure multiple times. As far as hierarchical clustering is concerned, it may be slow because this method consumes much memory.

Note: Hierarchical clustering method result is a hierarchy. We choose the clusters’ numbers after running the procedure. When we use the K-means method, it is our choice too. But we set the clusters’ number in advance here.

In practice, switching several clusters’ numbers, analysts choose the optimal based on the highest silhouette score. It is a separate topic to discuss. If you are interested in it, I can develop this into another article.

Thank you for reading! If you want to share your opinion or ideas with me, you can write freely in the comment section. Feel free to reach me on LinkedIn profile for any suggestions or clarifications.

Have a nice day!

Clustering: two approaches

Written by Margarita Arutiunova