Evaluation of Supervised Clustering (Purity) from scratch

vincydesy
2 min readNov 17, 2022

--

Photo by Pietro Jeng on Unsplash

Hello! We see how to perform a supervised clustering evaluation with purity.

Purity is a measure of the extent to which clusters contain a single class. Its calculation can be thought of as follows: For each cluster, count the number of data points from the most common class in said cluster. Now take the sum over all clusters and divide by the total number of data points. Formally, given some set of clusters M and some set of classes D, both partitioning N data points, purity can be defined as:

Now let’s see how to program it in Python with only numpy from scratch. We load the libraries and we create a function called purity that contains the classes of data and the predicted classes.

import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import load_iris
def purity(y_clust,y_class):

In the function, we set the lengths and create the list of subclusters.

size_clust = np.max(y_clust)+1
len_clust = len(y_clust)
clusters_labels = [None] * size_clust

On the whole cluster, for each element, I add the class in its subcluster.

    for i in range(len_clust):
index = y_clust[i]
if clusters_labels[index] is None:
clusters_labels[index] = y_class[i]
else:
clusters_labels[index] = np.hstack((clusters_labels[index], y_class[i]))

We calculate the purity (as described in the formula), in each subcluster I count the occurrences of the most frequent element, add them and divide by the total length and we return the purity.

    purity = 0
for c in clusters_labels:
y = np.bincount(c) #I find occurrences of the present elements
maximum = np.max(y) #I take the item more frequently
purity += maximum

purity = purity/len_clust

return purity

Now let’s try it all out. Let’s take a clustering method defined by sklearn (AgglomerativeClustering), set up the classes and test the purity of the model.

dataset = load_iris()
x = dataset.data
y = dataset.target
n_sample = x.shape[0]
n_features = x.shape[1]

agg_cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', compute_full_tree='auto', linkage='average') # 'ward' 'single' 'average' 'complete'
y_clust = agg_cluster.fit_predict(data)
y_class = y
print("Purity: ", purity(y_clust,y_class))

Here you are! In a few lines you have created a supervised evaluation method. I’ll leave you some links at the bottom. See you next time!

--

--