Understanding the K-Means Algorithm better

Saiyam Bhatnagar
Analytics Vidhya
Published in
3 min readJul 21, 2020

Hello readers. K-means Clustering is undoubtedly one of the most popular unsupervised learning algorithm. The reason behind it being used so frequently is the strong yet simple statistical backbone. This story would first explain the logical approach behind K-Means clustering, then it would bring forth a practical drawback and a few suggestions to avoid it.

K-Means is a non-deterministic algorithm. This means that a compiler cannot solve the problem in polynomial time and doesn’t clearly know the next step. This is because some problems have a great degree of randomness to them. These algorithms usually have 2 steps — 1)Guessing step 2)Assignment step. On similar lines is the K-means algorithm. The K-Means algorithm divides the data space into K clusters such that the total variance of all data points with respect to the cluster mean is minimized.

Fig (1) Function to be minimized. S(i) are clusters.

However, the approach that compiler takes does not involve Multivariate Calculus as it seems. Rather, the approach taken is iterative. Now, like any deterministic algorithm it has 2 phases. Guessing phase: Randomly initializing k means in the data space(Mu(k)s). Now, all the data points X(i)s (1,m) are assigned to clusters in accordance to which cluster mean they are closer to. Mathematically, this step tries to minimize the within cluster variance. Hence, every point is now assigned a cluster. Next is the assignment step. All the cluster means (Mu(k)s)are now assigned to the mean of the data points in the cluster. This step is repeated a couple of times. Refer to the image below.

Fig(2) The K Means Algorithm

Now similar to the most of non-deterministic algorithms, K-Means has a bad habit. Which is that every time you run a K-Means clustering it would give you different results. The situation gets even worsened when you are unsure if the any modification to the K-Means would improve the results. Refer to the image below to see how bad sometimes can a K-Means Algorithm’s result get.

Fig(3) two different results of K-Means when K=3

We should understand that we cannot do much about this issue as it is almost impossible to analytically imagine a data-space so huge. Although, we do have a couple of suggestions to follow.

  1. How to choose the value of K?

One should rely on the problem statement for this. For example in a tree specie classification problem, if one know the number of possible specie and given that all of them appear in significant numbers in the data-set, one can assign the number of species to K.

2. How to be sure if the solution obtained is appropriate?

There is no versatile approach for this issue. Rather one should focus on initializing the cluster means with the best possible estimate. There could be any statistical approach for this. For example — One could assign the calculated mean of a few data points which one is sure would fall in the same cluster. Secondly, one can iterate 10–100 times a K-Means Algorithm and decide the one which best minimizes the Mathematical function given above (Fig 1).

These were a few insights and suggestions into the K-Means clustering.

Thank You. :-)

--

--