Silhouette Analysis in K-means Clustering

Mukesh Chaudhary
5 min readJun 5, 2020

--

Cluster Evaluation: silhouette coefficient

In this blog , I am trying to explain tittle bit more on how to play more significant role in k-means clustering evaluation by silhouette analysis instead of elbow technique. Before go to this topic , we should know what k-mean clustering algorithm is about . I recommend to read my previous blog here to know about k-mean clustering algorithm. As we know that K-means clustering is a simplest and popular unsupervised machine learning algorithms. We can evaluate the algorithm by two ways . One is elbow technique and another is silhouette method. Here , I am trying to describe only silhouette analysis method and also trying to prove it is better than elbow method . I have already explain elbow method in my previous blog . Elbow is very simple method that it gives us plot like elbow shape . And we can easily guess optimal number of k from the plot . Maybe we become ambiguity to take decision when we get complex plot because i feel that sometime plot is vague . However , by silhouette method we can calculate silhouette coefficient and easily find exact number of k. let’s see picture that gives better concept:

Elbow method picture :

Silhouette method picture:

There are two things to consider here:

  • If we have the ground truth labels (class information) of the data points available with us then we can make use of extrinsic methods like homogeneity score, completeness score and so on.
  • But if we do not have the ground truth labels of the data points, we will have to use the intrinsic methods like silhouette score which is based on the silhouette coefficient. We now study this evaluation metric in a bit more details.
  • Let’s start with the equation for calculating the silhouette coefficient for a particular data point:

where,

- s(o) is the silhouette coefficient of the data point o

- a(o) is the average distance between o and all the other data points in the cluster to which o belongs

  • b(o) is the minimum average distance from o to all clusters to which o does not belong

There are main points that we should remember during calculating silhouette coefficient .The value of the silhouette coefficient is between [-1, 1]. A score of 1 denotes the best meaning that the data point o is very compact within the cluster to which it belongs and far away from the other clusters. The worst value is -1. Values near 0 denote overlapping clusters.

Let’s try to understand through python code that make more easy.

# import neccessaries librariesimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples,silhouette_score
# load petal data
data = datasets.load_iris()dir(data)
# load into Dataframe
df = pd.DataFrame(data.data,columns = data.feature_names)
print(df.shape)
df.head()
df1 = df.drop(['sepal length (cm)', 'sepal width (cm)'],axis = 'columns')
df1.head()
# plot scatter plot
plt.scatter(df1['petal length (cm)'],df1['petal width (cm)'])
# Now check silhouette coefficientfor i,k in enumerate([2,3,4,5]):

fig, ax = plt.subplots(1,2,figsize=(15,5))

# Run the kmeans algorithm
km = KMeans(n_clusters=k)
y_predict = km.fit_predict(df1)
centroids = km.cluster_centers_
# get silhouettesilhouette_vals = silhouette_samples(df1,y_predict)
#silhouette_vals
# silhouette ploty_ticks = []
y_lower = y_upper = 0
for i,cluster in enumerate(np.unique(y_predict)):
cluster_silhouette_vals = silhouette_vals[y_predict ==cluster]
cluster_silhouette_vals.sort()
y_upper += len(cluster_silhouette_vals)

ax[0].barh(range(y_lower,y_upper),
cluster_silhouette_vals,height =1);
ax[0].text(-0.03,(y_lower+y_upper)/2,str(i+1))
y_lower += len(cluster_silhouette_vals)

# Get the average silhouette score
avg_score = np.mean(silhouette_vals)
ax[0].axvline(avg_score,linestyle ='--',
linewidth =2,color = 'green')
ax[0].set_yticks([])
ax[0].set_xlim([-0.1, 1])
ax[0].set_xlabel('Silhouette coefficient values')
ax[0].set_ylabel('Cluster labels')
ax[0].set_title('Silhouette plot for the various clusters');


# scatter plot of data colored with labels

ax[1].scatter(df2['petal length (cm)'],
df2['petal width (cm)'] , c = y_predict);
ax[1].scatter(centroids[:,0],centroids[:,1],
marker = '*' , c= 'r',s =250);
ax[1].set_xlabel('Eruption time in mins')
ax[1].set_ylabel('Waiting time to next eruption')
ax[1].set_title('Visualization of clustered data', y=1.02)

plt.tight_layout()
plt.suptitle(f' Silhouette analysis using k = {k}',fontsize=16,fontweight = 'semibold')
plt.savefig(f'Silhouette_analysis_{k}.jpg')

Output:

In above all pictures , we can clearly see that how plot and score are different according to n_cluster(k) . So, we can easily choose high score and number of k via silhouette analysis technique instead of elbow technique.

Conclusion:

K-means clustering is a simplest and popular unsupervised machine learning algorithms . We can evaluate the algorithm by two ways such as elbow technique and silhouette technique . We saw differences between them above . I think silhouette technique gives us more precise score and number of k for k-means algorithm . However , we can also use elbow technique for quick response and intuition.

References :https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.metrics.completeness_score.html#sklearn.metrics.completeness_scorehttps://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.htmlhttps://en.wikipedia.org/wiki/Silhouette_(clustering)https://en.wikipedia.org/wiki/K-means_clusteringhttps://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

--

--