The K-Means algorithm needs no introduction. It is simple and perhaps the most commonly used algorithm for clustering.
The basic idea behind k-means consists of defining k clusters such that total within-cluster variation (or error) is minimum.
I encourage you to check out the below articles for an in-depth explanation of different methods of clustering before proceeding further:
- An Introduction to Clustering and different methods of Clustering
- A Beginner’s Guide to Hierarchical Clustering and how to perform it in Python
A cluster center is the representative of its cluster. The squared distance between each point and its cluster center is the required variation. The aim of k-means clustering is to find these k clusters and their centers while reducing the total error.
Quite an elegant algorithm. But there is a catch. How do you decide the number of clusters?
In this article, I will explain in detail two methods that can be useful to find this mysterious k in k-Means.
These methods are:
- The Elbow Method
- The Silhouette Method
We will use our own dataset generated by the code below for an illustration of the two methods:
This is how the data looks graphically:
Clearly, the dataset has 3 clusters. We will validate both our methods on this dataset.
The Elbow Method
This is probably the most well-known method for determining the optimal number of clusters. It is also a bit naive in its approach.
Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.
Within-Cluster-Sum of Squared Errors sounds a bit complex. Let’s break it down:
- The Squared Error for each point is the square of the distance of the point from its representation i.e. its predicted cluster center.
- The WSS score is the sum of these Squared Errors for all the points.
- Any distance metric like the Euclidean Distance or the Manhattan Distance can be used.
Let us implement this in Python using the sklearn library and our own function for calculating WSS for a range of values for k.
We obtain the following plot for WSS-vs-k for our dataset.
As expected, the plot looks like an arm with a clear elbow at k = 3.
Unfortunately, we do not always have such clearly clustered data. This means that the elbow may not be clear and sharp.
For Dataset A, the elbow is clear at k = 3. However, this choice is ambiguous for Dataset B. We could choose k to be either 3 or 4.
In such an ambiguous case, we may use the Silhouette Method.
The Silhouette Method
The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation).
The range of the Silhouette value is between +1 and -1. A high value is desirable and indicates that the point is placed in the correct cluster. If many points have a negative Silhouette value, it may indicate that we have created too many or too few clusters.
The Silhouette Value s(i) for each data point i is defined as follows:
Note: s(i) is defined to be equal to zero if i is the only point in the cluster. This is to prevent the number of clusters from increasing significantly with many single-point clusters.
Here, a(i) is the measure of similarity of the point i to its own cluster. It is measured as the average distance of i from other points in the cluster.
Similarly, b(i) is the measure of dissimilarity of i from points in other clusters.
d(i, j) is the distance between points i and j. Generally, Euclidean Distance is used as the distance metric.
The Silhouette score can be easily calculated in Python using the metrics module of the sklearn library.
I mentioned before that a high Silhouette Score is desirable. The Silhouette Score reaches its global maximum at the optimal k. This should ideally appear as a peak in the Silhouette Value-versus-k plot.
Here is the plot for our own dataset:
There is a clear peak at k = 3. Hence, it is optimal.
Finally, the data can be optimally clustered into 3 clusters as shown below.
The Elbow Method is more of a decision rule, while the Silhouette is a metric used for validation while clustering. Thus, it can be used in combination with the Elbow Method.
Therefore, the Elbow Method and the Silhouette Method are not alternatives to each other for finding the optimal K. Rather they are tools to be used together for a more confident decision.