All about “K-means” Clustering

Praveen Pareek
Mar 29 · 8 min read
Image for post
Image for post

In this article, we are gonna be talking about K-means Clustering!!!

We’re gonna learn how to cluster samples that can be put on a line. On an X-Y graph and even on a heat-map, and lastly, we’ll also talk about how to pick the best value for K.

Imagine you had some data that you could plot on a line, and you knew you needed to put it into 3 clusters. Maybe they are measurements from 3 different types of tumors or other cell types.

Image for post
Image for post

In this case the data make three, relatively obvious, clusters. But, rather than rely on our eye, let’s see if we can get a computer to identify the same 3 clusters.

Image for post
Image for post

To do this, we’ll use K-means clustering.

We’ll start with raw data, that we haven’t yet clustered.

Step 1: Select the number of clusters you want to identify in your data. This is the “K” in “K-means clustering”.

Image for post
Image for post

In this case, we’ll select K=3. That is to say, we want to identify 3 clusters.

There is a fancier way to select a value for “K”, but we’ll talk about that later.

Step 2: Randomly select 3 distinct data points. These are the initial clusters.

Image for post
Image for post

Step 3: Measure the distance between the 1st point and the three initial clusters.

Image for post
Image for post

Step 4: Assign the 1st point to the nearest cluster. In this case, the nearest cluster is the blue cluster.

Now do the same thing for the next point.

We measure the distances… Assign the point to the nearest cluster.

Image for post
Image for post
Image for post
Image for post

Now we figure out which cluster the 3rd point belongs to.

We measure the distances… And assign the point to the nearest cluster.

Image for post
Image for post

The rest of these points are closest to the orange cluster, so they’ll go in that one, too.

Image for post
Image for post

Now that all of the points are in clusters, we go on to…

Step 5: Calculate the mean of each cluster.

Image for post
Image for post

Then we repeat what we just did (measure and cluster) using the mean values.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Since the clustering did not change at all during the last iteration, we’re done…

The K-means clustering is pretty terrible compared to what we did by eye.

We can assess the quality of the clustering by adding up the variation within each cluster.

Image for post
Image for post

Here is the total variation within the clusters.

Image for post
Image for post

Since K-means clustering can’t “see ” the best clustering, its only option is to keep track of these clusters, and their total variance, and do the whole thing over again with different starting points.

So, here we are again, back at the beginning.

Image for post
Image for post
Image for post
Image for post

K-means clustering picks 3 initial clusters…and then clusters all the remaining points, calculates the mean of each cluster and then re-clusters based on the new means. It repeats until the clusters no longer change.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Now that the data are clustered, we sum the variation within each cluster.

Image for post
Image for post

And then do it all again…

Image for post
Image for post

At this point, K-means clustering knows that the 2nd clustering is the best clustering so far. But it doesn’t know if it’s the best overall, so it will do a few more clusters (it does as many as you tell it to do) and then come back and return that one if it is still the best.

Image for post
Image for post

Question: How do you figure out what value to use for “K”?

With this data, it’s obvious that we should set K to 3, but other times it is not so clear.

Image for post
Image for post

One way to decide is to just try different values for K.

Start with K = 1.

K = 1 is the worst case scenario. We can quantify its “badness” with the total variation.

Image for post
Image for post

Now try K = 2.

K = 2 is better, and we can quantify how much better by comparing the total variation within the 2 clusters to K = 1.

Image for post
Image for post
Image for post
Image for post

Now try K = 3.

K = 3 is even better! We can quantify how much better by comparing the total variation within the 3 clusters to K = 2.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Now try K = 4.

Image for post
Image for post

The total variation within each cluster is less than when K = 3.

Image for post
Image for post

Each time we add a new cluster, the total variation within each cluster is smaller than before. And when there is only one point per cluster, the variation = 0.

However, if we plot the reduction in variance per value of K… there is a huge reduction in variation with K = 3, but after that, the variation doesn’t go down as quickly.

Image for post
Image for post

This is called an “elbow plot”, and you can pick “K” by finding the “elbow” in the plot.

Question: How is K-means clustering different from hierarchical clustering?

K-means clustering specifically tries to put the data into the number of clusters you tell it to.

Hierarchical clustering just tells you, pairwise, what two things are most similar.

Image for post
Image for post
Image for post
Image for post

Question: What if our data isn’t plotted on a number line?

Just like before, you pick three random points…

And we use the Euclidean distance. In 2 dimensions, the Euclidean distance is the same thing as the Pythagorean theorem.

Image for post
Image for post
Image for post
Image for post

Then, just like before, we assign the point to the nearest cluster.

And, just like before, we then calculate the center of each cluster and re-cluster…

Although this looks good, the computer doesn’t know that until it does the clustering a few more times.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Question: What if my data is a heat-map?

Well, if we just have 2 samples, we can rename them “X” and “Y”. And then plot the data in an X/Y graph. Then we can cluster just like before!

Note: We don’t actually need to plot the data in order to cluster it. We just need to calculate the distances between things.

Even we have 2 samples, or 2 axes, the Euclidean distance is: sqrt(x2 + y2).

When we have 3 samples, or 3 axes, the Euclidean distance is: sqrt(x2 + y2 + z2).

When we have 4 samples, or 4 axes, the Euclidean distance is: sqrt(x2 + y2 + z2 + p2).

etc. etc. etc.

Image for post
Image for post

Hey! We’ve made it to the end of exciting blog-post. If you like this blog-post, and want to see more, Please follow me on this platform.

Alright! Tune in another time for another exciting blog-post.

Data Driven Investor

from confusion to clarity not insanity

Sign up for DDIntel

By Data Driven Investor

In each issue we share the best stories from the Data-Driven Investor's expert community. Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Praveen Pareek

Written by

Data Scientist and former Physics Faculty who found his true passion for data. I like to share a different perspective for many issues! “The Other Perspective!”

Data Driven Investor

from confusion to clarity not insanity

Praveen Pareek

Written by

Data Scientist and former Physics Faculty who found his true passion for data. I like to share a different perspective for many issues! “The Other Perspective!”

Data Driven Investor

from confusion to clarity not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store