ML101: Un Supervised Machine Learning -K Means Clustering

PTU AI CLUB
7 min readMar 11, 2024

--

Author: Yuvarani VD

Hey techies, PTU AI CLUB welcomes you with yet another article of the ongoing series ml 101. First, let’s brush up on some basics to follow along with today’s topic.

In supervised learning we were given the training set with the labels that is the training data set was tagged with its outputs and our goal was to find the decision boundary that separates different output labels.

Training set: { (𝑥(1) , 𝑦(1) ), (𝑥(2) , 𝑦(2) ), (𝑥(3) , 𝑦(3) ), . (𝑥(𝑚) , 𝑦(𝑚) )}

Where 𝑦 (1) 𝑥 (1) is the input of the first training example and is its corresponding output label. ‘m’ denotes the total number of training samples in the dataset.

Fig 1. Supervised Learning.

In contrast, in unsupervised learning we have a training set only with the input features. We don’t have an idea about the output labels.

Training set: { 𝑥(1) , 𝑥(2) , 𝑥(3) , . 𝑥(𝑚) }

Here, we have only the input features and not ‘y’- the output labels.

Fig2. Unsupervised Learning

We give this unlabelled training set to the algorithm or the learning model and we want it to find the underlying structure or the pattern in the training set. One type of structure is that the algorithm can group the data into separate clusters. The algorithm which clusters the data is called the clustering algorithm. The unsupervised learning algorithms find different types of structures or patterns from the given data. ‘Clustering algorithms’ are one among them. The algorithms which find other types of patterns or structures will be dealt in further sections.

Among these clustering algorithms, K-means clustering is the most popularly used algorithm.

This algorithm basically groups out the input data into clusters and the number of clusters is specified by the parameter- ‘K’. For example, if k is 2, then the algorithm has to group the given input data into two clusters or two groups. And that’s the justification of calling this algorithm as ‘K-means clustering’ .

So how is this clustering done ?

First we find the cluster centroids for each cluster, which is like the center spot of a particular cluster.

So for K clusters, we have K clusters.(note that it is uppercase K for the total number of clusters).

training set { Hence, The input to the K-means algorithm is the K parameter (number of clusters) and the 𝑥 (1) , 𝑥(2) , 𝑥(3) , . 𝑥(𝑚) } where 𝑥 (𝑖) ∈ 𝑅𝑛 ( n dimensional vector)

K-means is an iterative algorithm which works in two important steps.

Fig3. Steps in K-Means

Steps in K-Means:

Step 1: CLUSTER ASSIGNMENT STEP

This is the first step inside the loop. In this step, the algorithm goes through each of the data points and measures the distance between the data points and all the initialized cluster centroids and then it assigns a particular datapoint to the cluster centroid which is closer to that datapoint. For example, consider that data point shown in pink. In the step, the model will calculate the distance between it and the cluster centroids shown in blue and purple and it finds that the data point in pink is closer to the cluster centroid 2 shown in blue and so the datapoint in pink is assigned to the cluster 2

Fig4. Cluster Assignment

In such a way all the data points will be mapped to its nearest cluster centroid in step 1. So, here the cluster centroids remain unmoved and the data points are mapped.

STEP 2: MOVE CENTROID STEP

Here, the cluster centroids are moved to the location which is the average of the data points of that particular cluster. The algorithm sees the data points which belong to a particular cluster and finds the mean of the location of all those data points and then moves the cluster centroid there.

Fig5. Cluster Centroid

Again step 1 is repeated and the data points belonging to the particular clusters are updated and then in step 2, the cluster centroids are updated with those updated data points.

After another iteration:

Hence these two steps are done repetitively until the algorithm converges.

Okay so now we got to know the working, but is there any criteria to stop the clustering? Yes , absolutely lets look on them:

Stopping Criteria for K-Means Clustering:

There are essentially three stopping criteria that can be adopted to stop the K-means algorithm:

● Centroids of newly formed clusters do not change

● Points remain in the same cluster

● Maximum number of iterations is reached

We can stop the algorithm if the centroids of newly formed clusters are not changing. Even after multiple iterations, if we are getting the same centroids for all the clusters, we can say that the algorithm is not learning any new pattern, and it is a sign to stop the training. Another clear sign that we should stop the training process is if the points remain in the same cluster even after training the algorithm for multiple iterations.

THE ALGORITHM:

First we randomly initialize K cluster centroids.(we will discuss this in later part) Cluster centroids are denoted by µ , , ,… (‘k’ denotes the index of the cluster 1 µ2 µ3 µ𝑘∈ 𝑅𝑛 centroid and K denotes the total number of cluster centroids). Then as discussed earlier the algorithm goes with the previously mentioned two steps

Fig 6 . K-Means

So, by now we would be familiar with the cluster assignment step that assigns the data point to the cluster whose cluster centroid is closer to the data point. Totally there are ‘m’ data points and each of them are assigned to a particular cluster centroid by calculating the squared euclidean distance. The ‘k’ (index of the cluster centroid ) is the value which is assigned to 𝐶 (𝑖) where i is the index of the data point.

For example, consider 𝐶 (5) = 2; 𝐶 (7) = 2; 𝐶 (2) = 2

It means that the data points with index 5, 7, 2 (i.e., the cluster centroid 2 and they are assigned to the cluster 2 . 𝑥 , , (5) 𝑥(7) 𝑥(2) 𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑖𝑣𝑒𝑙𝑦) are closer to

Now that we have the clusters with the assigned data points, we have to reposition the already existing cluster centroid by finding the mean of the position of the data points belonging to that particular cluster. So µ 𝑘 has the value of the average which is found to reposition the centroid of the particular cluster with index ‘k’ .

For example, After step 1 we had cluster 2(k=2) had the data points In step 2 : µ 2 = 𝑥 (5) + 𝑥(7) + 𝑥(2) 3 𝑥 ∈ 𝑅𝑛 , (5) 𝑥(7) and 𝑥 (2)

Now thatyou have understood the algorithm, one question might arise in your mind! What if there exists a cluster centroid with no data points assigned to it after both steps? Well, simply you can eliminate that cluster centroid and you will be left with only K-1 clusters instead of K clusters. Or else, you can just randomly reinitialize the cluster centroid again. But the first approach of just eliminating is more common when there is no data point assigned.

The most important thing for a better output of this algorithm is that, the data points within the cluster must be similar to each other and the data points between two different clusters must be dissimilar.

In the next article we will be focusing on the optimization aspect and choosing the value of ‘k’ along with few code snippets . Until then happy Machine Learning.

--

--