Evaluation of Supervised Clustering (Purity) from scratch

2 min readNov 17, 2022

Hello! We see how to perform a supervised clustering evaluation with purity.

Purity is a measure of the extent to which clusters contain a single class. Its calculation can be thought of as follows: For each cluster, count the number of data points from the most common class in said cluster. Now take the sum over all clusters and divide by the total number of data points. Formally, given some set of clusters M and some set of classes D, both partitioning N data points, purity can be defined as:

Now let’s see how to program it in Python with only numpy from scratch. We load the libraries and we create a function called purity that contains the classes of data and the predicted classes.

import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import load_iris

def purity(y_clust,y_class):

In the function, we set the lengths and create the list of subclusters.

size_clust = np.max(y_clust)+1
len_clust = len(y_clust)
clusters_labels = [None] * size_clust

On the whole cluster, for each element, I add the class in its subcluster.

    for i in range(len_clust):
        index = y_clust[i]
        if clusters_labels[index] is None:
            clusters_labels[index] = y_class[i]
        else:
            clusters_labels[index] = np.hstack((clusters_labels[index], y_class[i]))

We calculate the purity (as described in the formula), in each subcluster I count the occurrences of the most frequent element, add them and divide by the total length and we return the purity.

    purity = 0
    for c in clusters_labels:
        y = np.bincount(c) #I find occurrences of the present elements
        maximum = np.max(y) #I take the item more frequently
        purity += maximum

    purity = purity/len_clust

    return purity

Now let’s try it all out. Let’s take a clustering method defined by sklearn (AgglomerativeClustering), set up the classes and test the purity of the model.

dataset = load_iris()
x = dataset.data
y = dataset.target
n_sample = x.shape[0]
n_features = x.shape[1]

agg_cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', compute_full_tree='auto', linkage='average') # 'ward' 'single' 'average' 'complete'
y_clust = agg_cluster.fit_predict(data)
y_class = y
print("Purity: ", purity(y_clust,y_class))

Here you are! In a few lines you have created a supervised evaluation method. I’ll leave you some links at the bottom. See you next time!

Evaluation of clustering

Next: K-means Up: Flat clustering Previous: Cardinality - the number Contents Index Typical objective functions in…

nlp.stanford.edu

Evaluation Metrics for Clustering Models

3 different metrics for clustering explained

towardsdatascience.com

Evaluation of Supervised Clustering (Purity) from scratch

Evaluation of clustering

Next: K-means Up: Flat clustering Previous: Cardinality - the number Contents Index Typical objective functions in…

Evaluation Metrics for Clustering Models

3 different metrics for clustering explained

Written by vincydesy