Understanding K-mean Clustering In Depth

Ipsita Shee
Analytics Vidhya
Published in
4 min readDec 13, 2020
Photo by Pierre Bamin on Unsplash

In this article, we are going to see how K-mean clustering actually works. I have taken a table containing sample data points and their coordinates or X and Y values. We will assign these data points to a specified number of clusters and assign these clusters their respective centroids by doing some calculations manually.

Suppose we have the following 6 data points(as shown in the diagram).

Sample data points. (Image source: Author)

For K-means clustering, we decide the “K” value or the number of clusters. Suppose here we choose to assign the given data points to two clusters, that is, we choose K = 2.

Initially, we will take data point number 1 (182,72) and data point number 2(170,56) as clusters “K1” and “K2” respectively.

“K1” and “K2” are clusters that have only one point currently, which is also their respective centroids. The values of the centroids are given in the table below.

Image Source: Author

Now we will find the Euclidean distance between each of the data points and the 2 clusters. The data point will belong to the cluster that lies closer to it.

The formula for calculating Euclidean distance. Here we will take (x,y ) as the values for initial centroids, and, (a,b) to be the values of the data points. (Image source: Author)

Let us start calculating!

Out first data point is 1 (182,72), which is also the only point in K1 and centroid of cluster 1 (K1). The same can be said for our second point (170,56) also, with respect to cluster 2(K2).

Now we will calculate the distance of point 3 from K1 and K2, and tabulate the results.

Calculating Euclidean Distances. (Image source: Author)

Since the distance of the data point number 3 from K2 is less than the distance between it and K1, it will belong to the second cluster, which is K2.

Image source: Author

The cluster K2 now has two data points, point 2 (170,56) and point 3 (168,60). Now we will recalculate the new centroid for K2.

Image source: Author

The new centroid for K2 is (169,58).

Now we will find the distance of data point 4 from centroids of K1 and K2(the new centroid).

Image source: Author

Since the distance of the data point number 4from K1 is less than the distance between it and K1, it will belong to the first cluster, which is K1.

Image source: Author

The cluster K1 now has two data points, point 1(185,72) and point 4(179,68). Now we will recalculate the new centroid for K1.

Image source: Author

The new centroid for K1 is (182,70).

Now we will calculate the distance for data point 5 (182,72).

Image source: Author

We can clearly see that data point 5 belongs to K1. Now we will recalculate the centroid for K1.

Image source: Author

The new centroid for K1 is (182,71). Now we will calculate the distance for our sixth and last point (188,77).

Image source: Author

We can see that data point 6 belongs to the first cluster, that is K1. Now we recalculate the centroid for the K1 cluster.

Image source: Author

The new centroid for K1 is (185,74).

The final clusters and centroids are as shown in the table below.

Image source: Author

Let us summarize the steps.

Image source: Author

I hope this helps you have a better understanding of K-mean clustering.

Before You Go

Thanks for reading! If you want to get in touch with me, feel free to reach me on ipsitashee4@gmail.com or my LinkedIn Profile.

--

--

Ipsita Shee
Analytics Vidhya

An aspiring product manager with an interest in data.