Stop using the Elbow Method
Silhouette Analysis: A more precise approach to finding the optimal number of clusters using K-Means
A common challenge we face when performing clustering with K-Means is to find the optimal number of clusters. Naturally, the celebrated and popular Elbow method is the technique that most data scientists use to solve this particular problem.
In this post, we are going to learn a more precise and less subjective approach to help us find the optimal number of clusters, the silhouette score analysis.
The flaw in the Elbow Method
In another post, I provide a thorough explanation of the K-Means algorithm, its subtleties, (centroid initialization, data standardization, and the number of clusters), and some pros and cons. There, I also explain when and how to use the Elbow Method.
The K-Means algorithm aims to minimize the mean squared distance between each instance and its closest centroid, defined as inertia. However, from this definition, arises a problem.
As long as we keep increasing the number of clusters k, the inertia will always decrease because the points will be closer to their centroids. Therefore, when choosing the right number of clusters for K-Means, we are looking for the minimum number of clusters that gives us reasonable inertia. That’s exactly what we try to achieve using the Elbow Method. Suppose we have the following data:
For humans, it may be clear that the data come from 5 different clusters, but when dealing with high-dimensional data, we are unable to visualize it with ease.
Using the Elbow Method, we would probably choose k = 4, as indicated on the left plot.
Note that, since two of the clusters are relatively close to one another, the Elbow Method leads us to think that those clusters are just one because if we place a centroid in-between both clusters, the relative distance from the data points to it will be short.
Thus, we need a more precise, rigorous, and reliable approach to define the optimal number of clusters for our clustering task. Silhouette score enters.
Silhouette score
The silhouette score is the mean silhouette coefficient over all instances of the dataset. The silhouette coefficient measures how close a point in one cluster is to points in the neighboring clusters, it ranges from -1 to 1.
Mathematically, the silhouette coefficient is given by
where a is the mean distance to the other instances in the same cluster (i.e., mean intra-cluster distance), and b is the mean nearest-cluster distance (i.e., the mean distance to the instances in the nearest cluster, excluding the instance own cluster).
The interpretation is that when b >> a the silhouette coefficient is closer to +1, which means that the instance is possibly close to the center of the cluster. Meanwhile, if b = a, then the silhouette coefficient is 0, signaling that the instance is on the decision boundary between two clusters. Finally, if a >> b, then, the instance is very close to another cluster center, meaning that it was probably assigned to the wrong cluster. Let’s plot the silhouette score for the K-Means clusters varying the number of clusters k.
This plot is much more informative than the plot used for the Elbow Method. It is clear that even though k = 4 is not a bad choice, the best number of clusters to segment the data is k = 5. The Silhouette score can be computed using Scikit-Learn:
Silhouette Diagram
An even richer visualization is the silhouette diagram. It is obtained when you plot every instance’s silhouette coefficient, sorted by their cluster and their silhouette coefficient. The thickness of each cluster tells us about the size of the cluster, and the width of each cluster represents the sorted silhouette coefficients of the instances in the cluster (the wider, the better).
An additional piece of information is the black dashed line that tells us the mean silhouette coefficient (i.e., silhouette score) of t1756he cluster. We can see the silhouette diagrams varying k from 3 to 6 below.
First, take a closer look at the diagram with k = 3. In this setting, clusters 0 and 1 are twice the size of cluster 2. Besides, almost all silhouette coefficients from cluster 0 are on the left side of the silhouette score dashed line. This means that k = 3 is not a good choice, since it tells us that probably some clusters got merged and the instances from cluster 0 are too close to other clusters.
On the other hand, everything seems better with k = 4 and k = 5. For both choices of k, the silhouette coefficients extend beyond the dashed line and are closer to 1. However, when k = 4, cluster 1 is twice the size of the others, at the plot on the right, we understand why. Two clusters got merged to form cluster 1, meaning that this is an unbalanced classification but we know beforehand that the dataset is balanced. Therefore, k = 5, is indeed a better choice because we get clusters of similar size.
Finally, when k = 6, several instances have a low silhouette coefficient. That’s because one of the clusters got chopped.
Decision boundaries
Another insightful visualization is the plot of the decision boundaries of K-Means. Even though it is limited to 2-dimensional data, it can be very useful to understand how K-Means segment the data. In the following figure, you can see that with k = 4, the algorithm is not capable of distinguishing the two clusters concentrated in the far right part of the plot (all instances were assigned to cluster 1). Meanwhile, with k = 5, we got the desired solution.
And that’s it, I hope you find this post useful, and I would love to know your thoughts in the comment section. The source code of all plots in this post is here.