How to Determine the Optimal K for K-Means?

Khyati Mahendru
Jun 17, 2019 · 5 min read
Image for post
Image for post

Introduction

The K-Means algorithm needs no introduction. It is simple and perhaps the most commonly used algorithm for clustering.

The basic idea behind k-means consists of defining k clusters such that total within-cluster variation (or error) is minimum.

I encourage you to check out the below articles for an in-depth explanation of different methods of clustering before proceeding further:

A cluster center is the representative of its cluster. The squared distance between each point and its cluster center is the required variation. The aim of k-means clustering is to find these k clusters and their centers while reducing the total error.

Quite an elegant algorithm. But there is a catch. How do you decide the number of clusters?

In this article, I will explain in detail two methods that can be useful to find this mysterious k in k-Means.

These methods are:

  1. The Elbow Method

We will use our own dataset generated by the code below for an illustration of the two methods:

This is how the data looks graphically:

Image for post
Image for post

Clearly, the dataset has 3 clusters. We will validate both our methods on this dataset.

The Elbow Method

This is probably the most well-known method for determining the optimal number of clusters. It is also a bit naive in its approach.

Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.

Within-Cluster-Sum of Squared Errors sounds a bit complex. Let’s break it down:

  • The Squared Error for each point is the square of the distance of the point from its representation i.e. its predicted cluster center.

Let us implement this in Python using the sklearn library and our own function for calculating WSS for a range of values for k.

We obtain the following plot for WSS-vs-k for our dataset.

Image for post
Image for post

As expected, the plot looks like an arm with a clear elbow at k = 3.

Unfortunately, we do not always have such clearly clustered data. This means that the elbow may not be clear and sharp.

Image for post
Image for post
Source: bl.ocks.org/

For Dataset A, the elbow is clear at k = 3. However, this choice is ambiguous for Dataset B. We could choose k to be either 3 or 4.

In such an ambiguous case, we may use the Silhouette Method.

The Silhouette Method

The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation).

Source: Wikipedia

The range of the Silhouette value is between +1 and -1. A high value is desirable and indicates that the point is placed in the correct cluster. If many points have a negative Silhouette value, it may indicate that we have created too many or too few clusters.

The Silhouette Value s(i) for each data point i is defined as follows:

Image for post
Image for post
Source: Wikipedia

Note: s(i) is defined to be equal to zero if i is the only point in the cluster. This is to prevent the number of clusters from increasing significantly with many single-point clusters.

Here, a(i) is the measure of similarity of the point i to its own cluster. It is measured as the average distance of i from other points in the cluster.

Image for post
Image for post
Source: Wikipedia

Similarly, b(i) is the measure of dissimilarity of i from points in other clusters.

Image for post
Image for post
Source: Wikipedia

d(i, j) is the distance between points i and j. Generally, Euclidean Distance is used as the distance metric.

The Silhouette score can be easily calculated in Python using the metrics module of the sklearn library.

I mentioned before that a high Silhouette Score is desirable. The Silhouette Score reaches its global maximum at the optimal k. This should ideally appear as a peak in the Silhouette Value-versus-k plot.

Here is the plot for our own dataset:

Image for post
Image for post

There is a clear peak at k = 3. Hence, it is optimal.

Finally, the data can be optimally clustered into 3 clusters as shown below.

Image for post
Image for post

End Notes

The Elbow Method is more of a decision rule, while the Silhouette is a metric used for validation while clustering. Thus, it can be used in combination with the Elbow Method.

Therefore, the Elbow Method and the Silhouette Method are not alternatives to each other for finding the optimal K. Rather they are tools to be used together for a more confident decision.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Khyati Mahendru

Written by

Mathematics and Computing | Data Science, ML, and AI | Always up for a good challenge https://www.linkedin.com/in/khyati-mahendru/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Khyati Mahendru

Written by

Mathematics and Computing | Data Science, ML, and AI | Always up for a good challenge https://www.linkedin.com/in/khyati-mahendru/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store