100 days of data science and AI Meditation (Day 26- Evaluation metrics, Part 3)
This is part of my data science and AI marathon, and I will write about what I have studied and implemented in academia and work every single day.
Today we will have a look at Clustering Metrics:
Clustering metrics are used to evaluate the quality of clusters generated by clustering algorithms. Here are some common clustering metrics along with their formulas and explanations:
- Silhouette Score:
- Formula for a single sample
i
: s(i) = (b(i) - a(i)) / max(a(i), b(i)) - Overall Silhouette Score: silhouette_score = mean(s(i) for all samples)
The silhouette score measures how close each sample in a cluster is to the samples in its neighbouring clusters. It ranges from -1 to 1, where higher values indicate better-defined clusters.
a(i)
is the average distance ofi
to other points in the same cluster.b(i)
is the smallest average distance ofi
to points in a different cluster.
It can be calculated using scikit-learn in the following way:
2. Davies-Bouldin Index:
- Formula for cluster
i
: R(i) = max((R(ij) + R(j)) / d(Ci, Cj)), wherei
andj
are cluster indices,R(i)
is the cluster radius, andd(Ci, Cj)
is the distance between cluster centroids. - Davies-Bouldin Index: DB = (1/n) * Σ(max(R(i) + R(j)) / d(Ci, Cj)), where
n
is the number of clusters.
The Davies-Bouldin index measures the average similarity between each cluster and its most similar cluster. Lower values indicate better clustering.
It can be calculated using scikit-learn in the following way:
3. Calinski-Harabasz Index (Variance Ratio Criterion):
- Formula: CH = (B(k) / (k — 1)) / (W(k) / (n — k)), where
B(k)
is the between-cluster dispersion,W(k)
is the within-cluster dispersion,k
is the number of clusters, andn
is the number of samples.
The Calinski-Harabasz index measures the ratio of between-cluster variance to within-cluster variance. Higher values indicate better-defined clusters.
Like the Silhouette Score, the higher the score the more well-defined the clusters are. This score has no bound, meaning that there is no ‘acceptable’ or ‘good’ value.
It can be calculated using scikit-learn in the following way:
How to measure clustering performance:
Measuring clustering performance involves evaluating the quality of the clusters generated by clustering algorithms. Different metrics and techniques can be used depending on whether you have ground truth labels (unsupervised) or not. Here’s how to measure clustering performance:
Unsupervised Metrics (No Ground Truth):
- a. Silhouette Score: Measures how close each sample in a cluster is to the samples in its neighbouring clusters. Ranges from -1 to 1, where higher values indicate better-defined clusters.
- b. Davies-Bouldin Index: Measures the average similarity between each cluster and its most similar cluster. Lower values indicate better clustering.
- c. Calinski-Harabasz Index (Variance Ratio Criterion): Measures the ratio of between-cluster variance to within-cluster variance. Higher values indicate better-defined clusters.
- d. Inertia (Within-Cluster Sum of Squares): Sum of squared distances between data points and their cluster centroids. Lower inertia indicates denser, more compact clusters.
- e. Normalized Mutual Information (NMI): Measures the mutual information between predicted clusters and true class labels, adjusted for chance.
- f. Adjusted Rand Index (ARI): Measures the similarity between predicted clusters and true class labels, accounting for chance.
Supervised Metrics (With Ground Truth):
- a. Homogeneity, Completeness, and V-Measure: These metrics evaluate the extent to which each cluster contains only members of a single class, the extent to which all members of a given class are assigned to the same cluster, and their harmonic mean.
- b. Adjusted Rand Index (ARI): Measures the similarity between predicted clusters and true class labels, adjusted for chance.
- c. Normalized Mutual Information (NMI): Measures the mutual information between predicted clusters and true class labels, adjusted for chance.
- d. Fowlkes-Mallows Index (FMI): Geometric mean of precision and recall between predicted clusters and true class labels.
The choice of metric depends on the nature of the data, the problem’s context, and the available ground truth information. In some cases, using a combination of metrics can provide a more comprehensive view of clustering performance. Additionally, visualizations such as scatter plots, dendrograms, and silhouette plots can help interpret and validate clustering results.
When evaluating clustering performance, it’s also crucial to understand the limitations of each metric and consider the domain knowledge to make informed decisions about the quality of the clusters generated by the algorithm.
Below is an example of a project about Customer Segmentation using K-Means Clustering
We will create a Python project for customer segmentation using K-Means clustering. The goal is to group customers based on their purchasing behaviour. We will use clustering metrics to evaluate the quality of the clusters and visualize the results.
The outcome of the code includes printed values for the Silhouette Score and Calinski-Harabasz Index, providing insights into the quality and separation of the clusters. Additionally, a scatter plot is displayed, showing how the data points are distributed among the clusters based on the first two features.
The Silhouette Score measures the closeness of each data point to its own cluster relative to other clusters. A higher Silhouette Score indicates well-separated clusters. The Calinski-Harabasz Index measures the ratio of between-cluster variance to within-cluster variance. Higher values indicate better-defined clusters.
The scatter plot visually shows how well the K-Means algorithm has grouped the data points into clusters based on the first two features. Different colours represent different clusters. The plot helps you understand how distinct the clusters are and whether the chosen number of clusters is appropriate for the data.
These clustering metrics help assess the quality of clusters produced by different clustering algorithms and parameter settings. Keep in mind that selecting the appropriate metric depends on the problem, data distribution, and the goals of the clustering analysis. It’s also essential to combine these metrics with domain knowledge to make informed decisions about the clustering results.
Reference: