How to interpret silhouette plot for k-means clustering.
In a previous post, I wrote on k-means clustering analysis for unsupervised learning. I gave a beginner guide to implementing k-means clustering to segment behavior of customers in a mall. There, I determine the optimum number of clusters using the elbow plot methods. Here, I will take a walk through using another method to select the optimal number of clusters in an unlabeled dataset. Let’s get to work, shall we? This is a follow up on a previous post I made on k-means clustering for beginners, here is the link. And I will be using the same dataset of mall customer segmentation, also known as market basket analysis to demonstrate this algorithm in its simplest form.
The customer mall data has about two hundred customers and five features. We are interested in the Annual incomes of these customers and using this variable with their spending score variable to group the customers into meaningful clusters The dataset is already clean and contains no null values, so let’s jump right into using silhouette plot to obtain the optimal n_clusters. However, before that, I will run through some of the qualities of this method.
Silhouette analysis studies the distance between neighboring clusters, while also giving information about the distance between points inside the same cluster. The plot displays a measure of how close a point in one group is to a nearby group. The measure has a range from -1 to 1. Where a value close to 1 show that a point is far from the neighboring clusters, and a value of 0 indicates that such a point is remarkably close to the decision boundary between the two neighboring clusters. A negative number implies that such points have been assigned to the wrong cluster.
The silhouette score for each data point can be calculated using the following distance.
S = b-a /max{a,b}
a = Mean distance between the observation and all other data points in the same cluster. It is also called mean intra-cluster distance.
b = Mean distance between the observation and all other data point in the nearest cluster
S = Silhouette score.
Using Sklearn library to compute the score.
Python sklearn package provides different methods for evaluating silhouette score.
from sklearn.metrics import silhouette_samples, silhouette_score
The silhouette_score for data set is used for measuring the mean of the Silhouette Coefficient for each sample belonging to different clusters.
score = silhouette_score(X, km.labels_, metric=’euclidean’)
How to analyze the Silhouette plot to select an optimal n_cluster.
- A sub-optimal n_cluster will show the presence of clusters below the average silhouette score. For this given data, none of the n_clusters shows value below the average score. (The average score is indicated with the red-dotted line).
- A sub-optimal n_cluster will show wide fluctuation in the size of the silhouette plot. For this given data, a wide fluctuation is seen in almost all the n_clusters , although less in the last where the number of values of n_cluster is five.
Thus, our preferred n_cluster will be five, as it passes the two criteria.
Here is the kaggle notebook.
Reference
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#
https://www.scikit-yb.org/en/latest/api/cluster/silhouette.html