Sitemap

What is Silhouette Score?

3 min readSep 7, 2023

The Silhouette score is a metric used to evaluate how good clustering results are in data clustering. This score is calculated by measuring each data point’s similarity to the cluster it belongs to and how different it is from other clusters. The Silhouette score is commonly used to assess the performance of clustering algorithms like K-Means.

When calculating the Silhouette score, the following steps are followed:

1. For each data point, the average distance (a_i) to other data points within the same cluster is calculated. This value represents the similarity level of the data point to others in its cluster.

2. For each data point, the average distance (b_i) to all other clusters it doesn’t belong to is computed. This value indicates how different the data point is from data points in other clusters.

3. The Silhouette score is calculated using the formula:

Silhouette Score = (b_i — a_i) / max(a_i, b_i)

4. By taking the average of the Silhouette scores calculated for each data point, an overall Silhouette score is obtained, which measures the success of clustering results.

Key characteristics of the Silhouette score include:

It ranges from -1 to +1:

  • Positive values indicate that data points belong to the correct clusters, indicating good clustering results.
  • A score of zero suggests overlapping clusters or data points equally close to multiple clusters.
  • Negative values indicate that data points are assigned to incorrect clusters, indicating poor clustering results.

A higher Silhouette score indicates better clustering results.

Therefore, the Silhouette score is an important criterion used to evaluate the settings and outcomes of data clustering algorithms. A high Silhouette score indicates more consistent and better clustering results, while a low score may indicate that data points are assigned to incorrect clusters or that the clustering algorithm is not suitable for the data.

Calculating Silhouette Score

Here we will see the data clustering process using the scikit-learn library in the Python programming language and evaluating the results with the Silhouette score.

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)
# Perform KMeans clustering
n_clusters = 4
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(X)
# Calculate silhouette score
silhouette_avg = silhouette_score(X, labels)
print("Silhouette Score:", silhouette_avg)

Here is a step by step explanation of this code:

1. First, we import NumPy and scikit-learn libraries.

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

2. We create a sample dataset.

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)
  • The make_blobs function creates a dummy dataset with sample (n_samples) and center point. Here, a dataset is created with 300 samples and 4 center points.
  • The cluster_std parameter controls how far the created cluster points are spread.
  • The random_state parameter provides the repeatability of the random number generation process.

3. Data clustering is performed using the K-Means clustering algorithm.

n_clusters = 4
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(X)
  • n_clusters is used as a variable specifying how many clusters will be created.
  • The KMeans class is used to implement the K-Means clustering algorithm. random_state controls the random initialization of the algorithm.
  • The fit_predictmethod allows to cluster the data and predict which cluster belongs to each data point. The results are stored in an array called labels.

4. The Silhouette score is calculated and printed.

silhouette_avg = silhouette_score(X, labels)
print("Silhouette Score:", silhouette_avg)
  • The Silhouette score is a metric used to evaluate how well K-Means clustering results are. This score measures how well a dataset is clustered by measuring within-cluster similarity and out-of-cluster discrimination.
  • The silhouette_score function calculates the Silhouette score using the array of labels containing the dataset and which cluster each data point belongs to.
  • The result is printed to the screen with the text “Silhouette Score:”.

Conclusion

As a result, this code sample creates a synthetic dataset, clusters this dataset with the K-Means clustering algorithm, predicting which cluster each data point belongs to, and then calculates and prints the Silhouette score to evaluate this clustering. The Silhouette score is a metric used to measure the quality of clustering results, and a higher score indicates a better clustering result.

--

--

Responses (1)