K- Means Clustering Explained | Machine Learning

Published in

Analytics Vidhya

5 min readSep 25, 2020

Before we begin about K-Means clustering, Let us see some things :
1. What is Clustering
2. Euclidean Distance
3. Finding the centre or Mean of multiple points
If you are already familiar with these things, feel free to skip to K-Means algorithm

What is Clustering ?

Clustering is nothing but grouping. We are given some data, we have to find some patterns in the data and group similar data together to form clusters . This is the basis of clustering.
This is done with the help of euclidean distance.

for example:
1. An athletic club might want to cluster their runners into 3 different clusters based on their speed ( 1 dimension )
2. A company might want to cluster their customers into 3 different clusters based on 2 factors : Number of items brought, no of items returned ( 2 dimensions )

What is Euclidean Distance ?

Distance between 2 co-ordinates can be found with the help of the euclidean distance. If the data is 2 dimensional and is in a plane , n will be 2 in the above formula and can be represented as (x,y) . If , it’s 3 dimensional data , n will be 3 and can be represented as (x,y,z).

Finding Centre or Mean of Multiple Points

Now, Consider the black points in the figure 2.1.
we need to find the centre of all the black points.

we know that this is 2 dimensional data as it has an x and y and is represented as (x,y)
In Order to find the centre , this is what we do.

1. Get the x co-ordinates of all the black points and take mean for that and let’s say it is x_mean.
2. Do the same for the y co-ordinates of all the black points and let’s call that y_mean.
3. Now, the centre point will be nothing but (x_mean,y_mean) which will result in that red polygon present at the centre of the black points

with this , we have successfully completed the pre requisites for K Means Clustering.

K-Means Clustering

What is K-Means Clustering ?

It is a clustering algorithm that clusters data with similar features together with the help of euclidean distance

How it works ?

Let’s take an example dataset for this purpose

Figure 3.1 represents our Data Points.
Now. Let’s say, we need to group these data into 2 categories.
Just by looking at this , we can say that the data points whose
x < -2 can be grouped together and x > -2 can be grouped together.

Similarly , if we want to group these data into 3 categories , we can say that the data on the left side can be grouped together, data on the middle can be grouped together, data on the right can be grouped together like shown in figure 3.2. Initially, we don’t know how many cluster should we form with the given data. I’ll get on to this later.

Sorry for the colour of the data points in the middle (it’s white and it almost matches with the background )

Now, let’s Implement K Means on the given data

Initialise the centroids(c1) randomly to some data points in the dataset ( Number of cluster centroids = Number of clusters you want to make ).
Mapping points to centroids : For each and every data in the dataset, calculate which centroid is closest to it out of all the centroids available and then that data point would be considered as a part of the particular centroid to which it is closest to (You can give different colours to each centroid and the data point belonging to the centroid will also have the same colour )
Moving centroids to the centre : For all the points that is part of a particular centroid , calculate the centre by using the method shown above and move the centroid of those points to that centre. ( Do this for all the centroids )

4. Now, As we changed the position of the centroids , the data points need to mapped to the centroids based on the new position of the centroid . So, Repeat steps 2 and 3 for some number of iterations until all the centroids stops moving.

If you still don’t understand , Take a look at figure 3.4 and read the above 4 steps again.

It consists of 3 east steps : Initialise, Map Points to Centroids , Move Centroids to means of all the points and repeat this process until no changes occur.

Now, if I perform these processes on the above data shown in figure 3.1. I will end up getting this graph, where the black points are the centroids for each cluster

Notice that this is could be a little different from what we humans would anticipate. But, It’s fine. This is how K-Means Clustering works

One More Note

As I have already said, we don’t know the exact value for the number of clusters to choose. But, There’s a way called Elbow Method, using which we can approximately do this. This will be covered in a Separate blog post.

Figure 4.1 would be the representation of clusters if we chose the number of clusters to be 2 for the above data.

The initialisation point of the centroid also has some amount of say in the resultant clusters we get. Since we initialise the centroids randomly, It is good to run the algorithm multiple times and plot the graph and then choose the majority voted result.

Conclusion

This might seem a little bit difficult to understand and the beginning. But, This is one of the most easiest algorithm out there and cluster analysis are used in many areas. Some of them are : Recommender Systems , Pattern Recognition and also in Image Processing.