How to measure clustering performances when there are no ground truth?

Haitian Wei
3 min readJan 2, 2020

--

source

Introduction

Clustering validation has long been recognized as one of the vital issues essential to the success of clustering applications. In general, clustering validation can be categorized into two classes, external clustering validation and internal clustering validation.

In this article, I will focus on internal clustering validation, which is the method we should use when there are no ground true label of data. The figure below listed 11 measures.

source: see reference two

Also these measurement can help us to determine the best partition and optimal cluster number of a set of objects. The general procedure is as following:

source: see reference two

Silhouette Coefficient

The Silhouette Coefficient is defined for each sample and is composed of two scores(shown in below), and a higher Silhouette Coefficient score relates to a model with better defined clusters.

a: The mean distance between a sample and all other points in the same class. This score measure the closeness of points in the same cluster.

b: The mean distance between a sample and all other points in the next nearest cluster. This score measure the distance of points of different clusters.

  • Advantages
  1. The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.
  2. The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
  • Drawbacks
  1. The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.
  2. High computational complexity: O(n²)

Calinski-Harabasz Index

The Calinski-Harabasz index also known as the Variance Ratio Criterion, is the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters, the higher the score , the better the performances.

  • Advantages
  1. The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
  2. The score is fast to compute.
  • Drawbacks
  1. The Calinski-Harabasz index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.

Davies-Bouldin Index

This index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters themselves. A lower Davies-Bouldin index relates to a model with better separation between the clusters.

  • Advantages
  1. The computation of Davies-Bouldin is simpler than that of Silhouette scores.
  2. The index is computed only quantities and features inherent to the dataset.
  • Drawbacks
  1. The usage of centroid distance limits the distance metric to Euclidean space.
  2. The Davies-Boulding index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained from DBSCAN.

References

--

--