K-Means Clustering

Amit Ranjan
Analytics Vidhya
Published in
4 min readNov 28, 2020

--

In this article we will see what K-Means Clustering means, what are the steps involved in this algorithm using mathematical approach and its applications.

What is Clustering?
Clustering is the unsupervised learning in Machine Learning algorithm which just involves grouping of the similar data points together in a clusters.

Let’s understand this by taking example:
Suppose you visited in a supermarket. What do you think how are the items arranged there?

Well all the items which are similar in nature has been put together so that customer can have easy access.

But suppose you have a pile of notes written on different topics in a room and your mentor asked you to group every notes based on the content. What do you think how much time it will take you to group the notes, assuming you have this much pile of notes?

Pile of Notes

This can be easily be done with the Clustering techniques. It will group all the notes based on certain topics and put it in the shelf with similar contents.

Each shelf has similar content

What is K-Means Clustering?
K-Means Clustering is a clustering technique in unsupervised learning which allows us to discover categories or subgroups or clusters in our dataset. It helps us to find the data points which are not explicitly labelled, find patterns, group them based on similarity and make better decision.

Seems fine. There is another question though!

What is K in the K-Means?
K in K-Means is the number of clusters identified from the data by the algorithm.

This was the basic understanding about what the algorithm means.

Steps of K-Means Algorithm:

  1. Choose the K number of clusters.
  2. Select K random points from graph or K data points from dataset as centroid of the clusters.
  3. Assign each data points to the closest centroid to form K clusters.
  4. Keep iterating following until we find optimal centroid so that assignments to the data points doesn’t change:
    a. Compute the distance between data points and centroids.
    b. Assign each data point to the cluster that is closer than other cluster (centroid).
    c. Compute the centroids for the clusters by taking the average of all data points of that cluster.

I know you are lost. But don’t worry let’s understand each of the steps better with the help of simple example:

Suppose you have Height and weight as columns in your dataset. You want to find different Age groups and their weight.

Height in cm and Weight in Kgs

Step 1: Let’s choose K = 2 as number of clusters.

Step 2: In this step we can either take 2 random data points or 2 data points from our dataset as centroid of the clusters.

Let’s take first two data points as the two centroids C1(185, 72) and
C2(170, 56).

Step 3 & 4: Let’s take Point P3(168,60). In which cluster it will go?
We have to calculate the distance between both the clusters and assign new data point in that cluster which has minimum distance between them.

Distance between C1 and P3, D13
= sqrt( (C1.X - P3.X)² + (C1.Y - P3.Y)² )
= sqrt( (185 - 168)² + (72–60)² )
= sqrt( 289+ 144 )
= sqrt( 433 )
= 20.80

Distance between C2 and P3, D23
= sqrt( (C2.X — P3.X)² + (C2.Y — P3.Y)² )
= sqrt( (170–168)² + (56–60)² )
= sqrt( 4+ 16 )
= sqrt( 20 )
= 4.48

So the point P3 will go into C2 cluster. Now we have to calculate the average of the centroid C2 (as it has minimum distance from Point P3) and newly added point and assign it back to centroid. This will be our new centroid.

C2 = ( ( 170 + 168)/2 , (60 + 56)/2 )
= ( 169 , 58 )

As of now we have two clusters with following data points:
Cluster 1: C1
Cluster 2: C2, P3

Like this, we have to group all the remaining data points into these two clusters so that the final cluster would be :
Cluster 1: C1, P4, P5, P6, P7, P8, P9, P10, P11, P12
Cluster 2: C2, P3

Now we have to check if there are any reassignment of data points into different clusters. It can be checked using above steps. In our case this is the final clusters.

Applications of K-Means Cluster

  • Market segmentation
  • Document Clustering
  • Image segmentation
  • Image compression
  • Customer segmentation
  • Analyzing the trend on dynamic data

--

--