Intuition Behind K-Means Clustering

Rohan Roney
Analytics Vidhya
Published in
5 min readApr 29, 2021

Clustering in general is a technique used by many people in their lives. You might find countless examples of clustering around the world. It could be a group of people sharing a table at a restaurant or a food store that displays different sections of food items be it meat , vegetables etc.

Example of Clustering

Similarly, in the world of machine learning clustering is a very common solution to a variety of unsupervised problems prevalent in the industry. It helps understand and make sense out of unstructured data such that clusters are formed having similar data points in the same cluster and having dissimilar data points in other clusters. A major use of clustering in the real life comes from the aspect of recommender systems such used by companies like Amazon to show you what you would like to buy based on your purchase history or OTT’s like Netflix which recommend you content based on your preferences and watch history.

Popular Method of Clustering is K-Means

What is K-Means Clustering?

Let us first try and understand this using a day to day example. Suppose you recently shifted to a hostel and you are expected to do your laundry on your own. At the end of the month you realise that you don’t have any good clothes left and there’s no other way but to wash your clothes which by now have accumulated to become a big pile. Having never done it before, you immediately call your mother and ask for her advice. She explains that there are specifications for each variety of clothes. Some are washed in hot water, while some in cold water, some need air-drying in particular and so on.

A collection of clothes

In simple terms, K-Means is one of the simplest clustering algorithms whose primary goal is to group similar elements or data points into a cluster. Here k is an input which represents the number of clusters you want i.e. giving k=4 will fetch you exactly 4 clusters from the data.

To continue our analogy and to understand how it translates to clustering, let’s assume that your mother asks to divide the pile into 3 clusters (here, k=3). You pick 3 different clothes at random and those will be the starting process for the creation of clusters. Now go through the massive pile of clothes, and try and fit them in one of the 3 clusters based on the attributes like water temperature, drying temperature, colour. Do understand the best fit does not rely on the cluster; it merely depends on the starting point you take. Technically, this starting point is known as the centroid of the cluster. Also these items that you try to fit would not be identical in nature. Hence try and pick the best match you can for them from amongst the 3 options.

Clusters of clothes each having a different set of attributes

On what basis is this similarity understood?

Simple. You just need to put each of the clothing items you pick from the pile closer/ far away from the starting item based on how similar they are i.e. the more similar the item more closer it will be to the starting point. At the end of this activity you realise that you have got 3 different clusters and you can start your washing activity. But wait………………………

This process is an iterative process. Since our starting points were kind of random, we need to make sure that ultimately our starting points need to be the best they can be. For this, we would need to determine the new centroid value for all the 3 clusters again and again till such a point that

  1. there are no shifts of clothing items between the clusters
  2. you have done a considerable number of iterations to make the best clusters you could.

Finally, if you are a perfectionist, you will have to repeat the above experiment for k=4,5,….. and so on until you get the best clusters possible.

Algorithm behind K-Means

  1. Cluster the data into k groups where the value of k is predefined.
  2. At random, select total k points which may/may not be from the dataset which would be known as the cluster centres or centroids.
  3. Assign the data points to their closest centroids based on any distance function. This step is called the Expectation step.
  4. Calculate the centre point/ mean of each clusters again and replace it as the centroid point. This is the Maximisation step.
  5. Repeat the Expectation-Maximisation steps until either no data points move from their previous clusters, or centroid does not change from the previous iteration.
Process of K-Means Clustering

Practical implementation of the above mentioned algorithm for a 2-D data on python is given below. Let’s have a look

A raw dataset represented in the form of scatterplots
The dataset easily split into clusters using K-Means where k=4

Certain disadvantages observed while using this algorithm are as follows:

  • It requires to specify the number of clusters (k) in advance.
  • It can not handle noisy data and outliers.
  • It is not suitable to identify clusters with non-convex shapes.

Conclusion

K-means clustering is perhaps the most mainstream algorithm and generally the principal thing specialists apply when tackling grouping undertakings to find out about the construction of the dataset.

K-means clustering is perhaps the most mainstream algorithm and generally the principal thing specialists apply when tackling grouping undertakings to find out about the construction of the dataset. The objective of k-means is to group information into non-covering subgroups. It does an excellent job when the bunches have sort of spherical structures. To be a specialist, it’s nice to know the presumptions behind calculations/techniques of each algorithms so you would have a very smart intuition about the strength and shortcoming of every strategy. This will assist you with choosing when to utilise every strategy and under what conditions.

This blog gave you an overall understanding of the concept of K-Means and the intuition behind its algorithm.

--

--

Rohan Roney
Analytics Vidhya

Avid Reader, Blog writer and a Data Science Trainee at Almabetter