K Means Clustering in Python

Rohit Raj
Thrive in AI
Published in
2 min readMar 2, 2022

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to a large number of samples and has been used across a large range of application areas in many different fields.

We can use the sklearn library in python for k-means clustering.

First, we create sample data for the k means clustering algorithm using the following code:

For k means clustering we have to provide number of clusters to the algorithm. We can divide the points into 4 clusters using the k means algorithm implementation in sklearn

We can plot the identified cluster using the following code

We can see the k means clustering has done a nice job of identifying the four different clusters. But what to do when we don't know the number of clusters. In such a situation we can use silhouette score. It is highest when the number of clusters is optimal.

I iterate over a number of clusters and chose a number of clusters with the highest silhouette score. In this way, we don't have to provide a number of clusters as an argument.

We can plot the output as below

We see that we have got the same output as before without specifying the number of clusters.

In this article, we learn how to do apply k means clustering without specifying the number of clusters. If you liked the article, please like and subscribe to my newsletter.

--

--

Rohit Raj
Thrive in AI

Studied at IIT Madras and IIM Indore. Love Data Science