K Means Clustering in Python
The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to a large number of samples and has been used across a large range of application areas in many different fields.
We can use the sklearn library in python for k-means clustering.
First, we create sample data for the k means clustering algorithm using the following code:
For k means clustering we have to provide number of clusters to the algorithm. We can divide the points into 4 clusters using the k means algorithm implementation in sklearn
We can plot the identified cluster using the following code
We can see the k means clustering has done a nice job of identifying the four different clusters. But what to do when we don't know the number of clusters. In such a situation we can use silhouette score. It is highest when the number of clusters is optimal.
I iterate over a number of clusters and chose a number of clusters with the highest silhouette score. In this way, we don't have to provide a number of clusters as an argument.
We can plot the output as below
We see that we have got the same output as before without specifying the number of clusters.
In this article, we learn how to do apply k means clustering without specifying the number of clusters. If you liked the article, please like and subscribe to my newsletter.