Kmeans++ A Careful Seeding technique

Sourav Dash
3 min readApr 12, 2020

What we call chaos is just patterns we haven’t recognized. What we call random is just patterns we can’t decipher.

Kmeans clustering

As we all know the k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster.

It’s an approximate solution and finding an exact solution to this problem is NP-hard.

As the algorithm improve an arbitrary clustering by finding a local optimum it is potentially arbitrarily worse than optimal solution. This situation can arise due to the random initialization of the centroids.

Solution by local optimum with random initialization of centroids

In the above example, you can see how a random initialization of centroid leads to improper clustering.

k-means++ is an improvement for initial seeding

The intuition behind this approach is that spreading out the k initial cluster centers is a good thing.

The algorithm is as follow :

1. Choose one center uniformly at random from among the data points.

2.For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.

3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to square of D(x).

4.Repeat Steps 2 and 3 until k centers have been chosen.

Simply it means that at first, we have to find a random centroid. After that, we have to repeatedly find the k-1 points such that their distance from their nearest centroid will be the maximum for all k-2 point . In this way, the result will be a set of points where the distance between them will maximum possible.

centroid initialization technique

In the above, the data points and the number of clusters is X and n respectively.

Center initialization with k-means++

In the above, the initialized centers are selected as far as possible from each other.

After applying k-mean clustering

KMeans++ centers are distributed over the data it is more likely to have less cost(WCSS) then random initialization.

As it is random initialization in KMeans it gives different results depending on your initial set of centers.

KMeans pick the initial center smartly it takes fewer llyods iteration to converge and gives a better result then random.

Reference

https://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf

--

--