K-means Clustering in datasets to find the characteristics of groups in Google Colab

5 min readMar 11, 2021

K-means is a very popular clustering algorithm and that’s what we are going to look into today.

K-Means clustering is an unsupervised learning algorithm.

‘K’ in K-means is a free parameter wherein before you start the algorithm you have to tell the algorithm what is the value of ‘K’ that you are looking for.

In this article, today we are going to solve a particular dataset where you have the age and income of different people. Now clustering these datasets into different groups what we will try to find out is some characteristics of these groups

Now, let’s get started.

Importing Libraries

The first thing that you have to do is import all the necessary libraries.

Load Data

Now let us try to load the data from the CSV file. You must upload or save it in your local system or drive.

Here, read_csv is a method that helps to load the data file. And head( ) presents the first 5 rows of the data.

After I run this cell you can see the first 5 cells of the datasets starting from 0 to 4.

Plot it on the scatter plot

Since the datasets are simple enough I will try to plot them on the scatter plot. Here, I will just plot the age against the income.

You can kind of see three clusters. So, in this particular case choosing ‘K’ is pretty straightforward.

Importing K-means algorithm & Fit and Predict

So, let’s specify our ‘K’, which is equal to 3 here.

Here, I am going to fit and predict the data frame excluding the name column. Because the name column is a string and it is not that useful in this numeric computation.

So, what this statement did is it ran a K-means algorithm on age and income. And it computed the cluster as per our criteria.

You can see the 3 clusters including 0, 1, and 2.

Separate 3 clusters into 3 different data frame

I have plotted these three data frames into three different colors which are Green, Red, Black.

The scatter plots look okay but I think there is a problem with the green and black clusters. You can see that they are not grouped correctly.

Do you know why this problem happened?

Because our scaling is not right. The range of the x-axis is pretty narrow and our y-axis is scaled from 40,000 to 160,000.

So, when you don’t scale your features properly you might get into this problem and that’s why we need to do some preprocessing and use a min-max scaler. And then only we can run our algorithm all perfectly.

Preprocessing using the min-max scaler

So, we have the age and income column features properly scaled now.

Now, if you plot these on the scatter plot they will look structured-wise perfect.

KUDOS…!!! We made it.

Train the data

Now the next step is to use the k-means algorithm once again to train our scale data set.

It’s gonna be fun now…!

Now, let’s plot these data to our scatter plot

Now, you can see that I have a very pretty cluster. They look very nicely formed. Isn’t It?

Elbow Plot

Let’s look into the Elbow plot methods. The datasets that I have used here are simple but when you are trying to solve real-life problems you will have come across data set which will have a lot of data and features. And it will be hard to plot it on the scatter plot and it will just get messy.

And then you will be like what do I do now???

Don’t worry!

You can use your Elbow Plot Method.

Let’s define our K range. Let’s assume it to 1 to 10.

For K in k range, I am just going through 1 to 9, and then for each iteration I will create a new model with clusters = k. Then I will can the fit function.

I will try to fit my data frame but my data has a name column. And I will not be wanting to use that column. I have avoided that column here.

Here, sse means sum of square error. In this case I will use a parameter called inertia that will give you the sum of squared error.

Let’s plot this into a nice chart.