Photo by Anne Nygård on Unsplash

Measuring clustering quality

How to measure the goodness of your clustering?

Mastafa Foufa
7 min readJan 23, 2023

--

How did this start for me?

I started with my knowledge from school, then looked up if I were missing some new metrics for performance evaluation. I quickled landed in the sklearn documentation.

We can read there that evaluating clustering quality is not a trivial task and is not necessarily as intuitive as you would think in a supervised setting.

Evaluating the performance of a clustering algorithm is not as trivial as counting the number of errors or the precision and recall of a supervised classification algorithm. In particular any evaluation metric should not take the absolute values of the cluster labels into account but rather if this clustering define separations of the data similar to some ground truth set of classes or satisfying some assumption such that members belong to the same class are more similar than members of different classes according to some similarity metric.

Let’s start with a reminder for everyone. Clustering consists in grouping your data automatically based on certain common features. Your inputs are then grouped (or rather clustered) into meaningful clusters of similar elements.

--

--

Mastafa Foufa

Data Scientist @Microsoft | ex-Teacher @EPITA Paris | 8 patents in AI