The Complete Guide to K-Means Clustering: Part 1 — The Basics

Tushaar Batheja
6 min readMay 28, 2023

--

Image Segmentation using K-Means

In this installment of a 3 Part Series, we will walk step-by-step through everything you need to know to understand K-Means Clustering.

Part 1: The Basics

  1. Basics: Understanding clustering, Euclidean Distances etc.
  2. Intuition: A Visual Walkthrough of K-Means in Action

Part 2: Coding the Algorithm from Scratch

  1. Algorithm: Formal Overview
  2. Code Implementation: Python Implementation from Scratch

Part 3: Real Life Implementation

  1. Conclusion: Using our K-Means implementation for Image Compression

Feel free to skip to a particular section. By the end, you will have a solid understanding of K-Means and be able to apply it to your own datasets.

In Part 1, let us explore K-Means through the lens of an absolute beginner.

Basics:

Before we begin formally about K-Means clustering, let us make sure we understand these concepts:

  1. What is Clustering
  2. Euclidean Distance: What and Why?
  3. Centroids

If you are already familiar with these things, feel free to skip to the next section.

What is Clustering ?

Clustering in its simplest form is just grouping. Imagine you have a basket of fruits. You could group various fruits in the basket based on their features — like color, shape or taste. In essence, you choose a ‘feature’ which seems important, and then group the fruits based on their similarity to each other.

Clustering Algorithms help us achieve the same effect by finding patterns in data and grouping similar data to form clusters.

Euclidean Distance: What and Why?

When data is represented as points in a multi-dimensional setting, a good measure of similarity in features is how close two data points are. Euclidean Distance is the straight-line distance between any two points. It is measured as the square root of the sum of squared differences between corresponding coordinates.

Centroids

A centroid represents the center point of a cluster, and it is computed as the mean or average of all the data points in that cluster. In Basic Coordinate Geometry, Centroid of a Triangle was famously taught as the mean of x-coordinates and y-coordinates.

We extend that same idea, only now our ‘points’ are all the points in the cluster in question

The Black Dots represent our centroids for each cluster. No matter how many dimensions our data points are in, centroid will still be mean of each respective dimension.

Figure showcasing Ideal clustering

Intuition: A Visual Walkthrough of K-Means in Action

In this section, we will try to manually cluster the data points given below, building intuition on how we should approach the question. For now, let us assume we want to make two clusters. Remember, this might not be a given in K-Means and we shall come back to it later.

You initial thought might be to group the left 4 dots and the right 4 dots into two big circle-like blobs. That would be correct, but we want to approach the question like a computer, not a human trained on hundreds of years of evolution to identify patterns. However, the circle-like blob does lend us some idea. If in fact the goal is to make a circle-like blob, what is the one property every circle has? It has a centre! (or centroid in our case)

Step 1: Initialize

Now the question is, where should this centroid be? The answer is: We don’t know (not yet). Our best guess right now is as good as a random point. So we will do exactly that. Randomize!

The orange and green diamonds will represent our centroids respectively.

Step 2: Cluster Assignment

Now that we have our centroids, the next task is to make a blob around them. Since we only have two clusters, all points will either belong to orange or green. So how do we decide which one? Remember, simple is key. All we need is the Euclidean Distance!

We calculate the distance between each point and the cluster center, and choose the cluster center closest to us. This has been represented by arrows above.

So, we done? Uhm, not exactly. Notice the clusters aren’t quite what we initially imagined them to be. You might also be thinking that this is highly dependant on initial centroid position. You are correct, so let us think of a way to update our centroids.

Step 3: Update Centroids

Our initial idea was to make a circle-like cluster. But in a circle-like cluster, the ‘center’ is roughly equidistant from all points in cluster. However, in our case, our centroid is simply anywhere that happens to be closest to a point. So, in this step, what if we calculate the actual centroid co-ordinates for all points in a cluster, and update our centroid to match that?

The arrows have been left behind deliberately to give a sense of change in position that has occurred. Notice how the centroid looks much more in center of the points it has the responsibility of!

But uh oh, remember our first rule? Each point is assigned to the centroid closest to it. We just shifted our centroid, and in doing so, broken the first rule! Some points of the cluster might get ‘angry’ now, and be closer to another centroid. Seems as if we need to reassign clusters

Step 4: Reassign Clusters

Notice how the data point at roughly 0.8, 1.2 has changed cluster to the orange gang now. I think you know what to do next. Since cluster points have changed, their ideal centroid has changed!

Step 5: Update Centroids (again)

Notice the drastic shift in centroid this time. Seems like we are getting closer to our initial thought of two left and right blobs. Lets repeat steps 4 and 5 again and see what happens? Keep noticing if any points change their allegiance.

Reassign Clusters

Update Centroids

Notice how we are approaching saturation. No new points will need to change clusters. Hence, our final answer becomes:

I hope that helped you build an intuition on how simple the idea behind K-Means is, before all the technical jargon comes in. In the next section, we will explore K-Means more formally and explore some technicalities we glossed over in our basic example.

Another animation showcasing K-Means in action. The code can be found here on my GitHub.

--

--