CLUSTERING

Akshita Guru
4 min readApr 5, 2024

--

Welcome back! We have learned about supervised learning thus far. Let’s now examine unsupervised learning. The fundamentals are available here.

What is Clustering?

  • A clustering algorithm looks at a number of data points and automatically finds data points that are related or similar to each other. It is a part of unsupervised learning.

Given a dataset like this with features x_1 and x_2. With supervised learning, we had a training set with both the input features x as well as the labels y.

In contrast, in unsupervised learning, you are given a dataset like this with just x, but not the labels or the target labels y. Because we don’t have target labels y, we’re not able to tell the algorithm what is the “right answer, y” that we wanted to predict.

Instead, we’re going to ask the algorithm to find something interesting about the data, that is to find some interesting structure about this data. But the first unsupervised learning algorithm that you learn about is called a clustering algorithm, which looks for one particular type of structure in the data. Namely, look at the dataset like this and try to see if it can be grouped into clusters, meaning groups of points that are similar to each other.

Now clustering has a K-means technique that is discussed below-

  • The first thing that the K-means algorithm does is it will take a random guess at where might be the centers of the two clusters that you might ask it to find. But the very first step is it will randomly pick two points, which I’ve shown here as a red cross and the blue cross, at where might be the centers of two different clusters. And the centers of the cluster are called cluster centroids. After it’s made an initial guess at where the cluster centroid is, it will go through all of these examples, x(1) through x(30), my 30 data points. And for each of them it will check if it is closer to the red cluster centroid, shown by the red cross, or if it’s closer to.

The second of the two steps that K-means does is, it’ll look at all of the red points and take an average of them. And it will move the red cross to whatever is the average location of the red dots, which turns out to be here. And so the red cross, that is the red cluster centroid will move here. And then we do the same thing for all the blue dots.

Now check for every one of them, whether it’s closer to the red or the blue cluster centroid for the new locations. And then we will associate them which are indicated by the color again, every point to the closer cluster centroid. And if you do that, you see that the field points change color.

And it turns out that you end up moving the red cross over there and the blue cross over here.

And it turns out that if you were to keep on repeating these two steps, that is look at each point and assign it to the nearest cluster centroid and then also move each cluster centroid to the mean of all the points with the same color.

If you keep on doing those two steps, you find that there are no more changes to the colors of the points or to the locations of the clusters centroids. And so this means that at this point the K-means clustering algorithm has converged. Because applying those two steps over and over, results in no further changes to either the assignment to point to the centroids or the location of the cluster centroids. In this example, it looks like K-means has done a pretty good job. It has found that these points up here correspond to one cluster, and these points down here correspond to a second cluster. So now you’ve seen an illustration of how K-means works.

I hope you found this summary of clustering to be interesting.

You can connect me on the following:

Linkedin | GitHub | Medium | email : akshitaguru16@gmail.com

--

--