Let’s Deep Dive Into K-Means Clustering!

Unsupervised Clustering Approach

Published in

Analytics Vidhya

10 min readAug 28, 2021

Data is the new gold — but, like gold, it must be processed to extract useful information. The massive size of data gets created every second, therefore it is not surprise that most of the generated data is unlabeled. There are multiple algorithms that can work well with unlabeled data. In truth, dealing with unlabeled data is an entire topic of Machine Learning termed “Unsupervised Learning.”

In this situation, we must analyze the data distribution and organization where we come up with clustering technique. Generally, clustering works with unlabeled data, but it can work with labeled data as well. What is clustering? — Simply, it is grouping similar things together. It specifies that the points inside a cluster should be comparable. Clustering is the process of splitting a population or set of data points into many groups so that data points in the same group are more similar than data points in other groups. To put it another way, the goal is to separate groups with similar characteristics and assign them to clusters.

In the following sections, we will understand the k-means clustering algorithm and it’s sub components deeply. We’ll go over clustering, why it’s important, and how to use it, before diving into k-means clustering (including how to perform it in Python on a real-world dataset). Finally, we will look how to measure the cluster quality and select the optimum value for K.

What is K-Means Clustering?

The most commonly used clustering method is K-Means due to it’s simplicity.

The goal is to keep the distance between points within a cluster as small as possible. K-means is a centroid-based or distance-based algorithm in which the distances between points are calculated to allocate a point to a cluster. Each cluster in K-Means is associated with a centroid.

“The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.”

Let’s kick things off with a simple example.

We have these 9 points and want to use k-means to organize them into clusters. Here’s how we should go about doing it.

The Algorithm

When working with an unlabeled dataset, K-means clustering is a good place to start. The number of clusters is denoted by K in K-Means. After a few iterations, this algorithm is bound to find a solution. It consists of five basic steps:

Decide on the number of clusters (k)
Initialize Cluster Centroids
Assign data points to Clusters
Recompute the centroids of new clusters
Repeat step 3–4 until the stopping condition is met.

You don’t have to start with 3 clusters initially, but 2–3 is generally a good place to start, and update later on.

Step 1: Decide on the number of clusters (k)

The number of clusters, k, is chosen first in k-means.

Step 2: Initialize cluster centroids

The centroid for each cluster is then chosen at random. Let’s assume we want two clusters, in which case k is equal to two. The centroid is then chosen at random:

The centroid for these clusters is shown by blue and red circles.

Step 3: Assign data points to the Clusters

After centroids have been initialized, we assign each point to the cluster centroid that is closest to it: Re-assign each point to the cluster centroid that is nearest to it.

The points that are closest to the red point are assigned to the red cluster, and those that are closest to the blue point are assigned to the blue cluster.

Step 4: Recompute the centroids of new clusters

After we’ve allocated all of the points to one of the two clusters, we’ll compute the centroids of the newly created clusters:

The latest centroids are represented by the red and green crosses.

Step 5: Repeat steps 3 and 4

Steps 3 and 4 are then repeated:

A single iteration consists of computing the centroid and assigning all points to the cluster based on their distance from the centroid. But when can we put a stop to this? Isn’t it true that it can’t go on forever?

Steps 4 and 5 should be repeated until no modifications are possible: We’ll repeat the 4th and 5th phases in the same way before we hit global optimum. When there will be no more data point swapping between two clusters for two consecutive repeats.

⚠️ Warning!

Before using K-Means, make sure the data has been pre-processed. If your dataset does not already have numerical values, you would need to do so in order to perform calculations. Using feature reduction strategies will also speed up the process while also improving the outcomes. These steps are essential because K-Means, like all other algorithms that use average/mean values, is sensitive to outliers. These problems can be solved by following these steps.

Termination Condition for K-Means Clustering

To stop the K-means algorithm, there are basically three conditions that can be used:

The total number of iterations has been achieved
The points stay in the same clusters
The centroids of newly developed clusters stay the same.

If the maximum number of iterations is reached, we can stop the training. Assume that the number of iterations is set to 100. Before ending, the loop will repeat for 100 iterations.

Another indication that we can stop the training phase is if the points stay in the same cluster after several iterations of the algorithm.

Finally, if the centroids of newly formed clusters do not change, we may avoid the algorithm. If we get the same centroids for all the clusters after multiple iterations, we can conclude that the algorithm is not learning any new pattern and that we should stop practicing.

Implementing K-Means Clustering in Python from Scratch

The following example of implementing the K-Means clustering algorithm will assist in our understanding of the algorithm. It’s a clear illustration of how k-means functions. In this example, we will first construct a 2D dataset with four different blobs, then apply the k-means algorithm to see the results.

We’ll begin by importing all of the required packages. − To begin, add all of the necessary libraries to your computer:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans

The following code will create a two-dimensional data with three blobs.

from sklearn.datasets.samples_generator import make_blobsX, y = make_blobs(n_samples = 500, centers = 7, cluster_std = 0.70, random_state = 0)

Following code will split the data as train and test dataset.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)print("X_train : ", len(X_train))
print("X_test : ", len(X_test))

You will get the following output:

X_train :  335
X_test :  165

Following that, the code below will assist us in visualizing the dataset.

plt.scatter(X[:, 0], X[:, 1], s = 20);
plt.show()

Next, create a KMeans object with the number of clusters you want, train the model, and make the prediction as follows:

kmeans = KMeans(n_clusters = 7)
kmeans.fit(X_train)
y_kmeans = kmeans.predict(X_test)
print(y_kmenas)

Output:

[6 0 1 4 3 2 6 3 1 2 3 3 6 0 3 0 2 0 0 5 0 6 3 3 4 1 5 3 5 0 6 3 4 5 2 1 1 1 2 3 0 5 1 0 6 6 4 5 4 2 6 2 0 3 2 1 0 1 5 4 4 4 2 0 4 6 3 4 2 2 6 2 4 5 1 6 5 2 5 5 3 4 3 0 2 6 5 5 2 3 1 2 6 2 6 0 6 3 4 6 3 4 3 0 3 3 2 6 0 0 3 2 5 6 2 2 0 2 0 3 2 5 6 6 5 4 6 1 3 2 4 6 5 5 0 3 0 4 4 2 4 1 3 4 6 1 4 1 2 0 3 1 1 4 0 6 2 5 3 5 2 3 2 1 0]

Following that, the code below will assist us in visualizing the dataset.

plt.scatter(X[:, 0], X[:, 1], c = y_kmeans, s = 20, cmap = 'summer') 
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c = 'blue', s = 100, alpha = 0.9);
plt.show()

I’m sure you’ve had one question on your mind since the beginning of this article: how many clusters should we make? What, in other words, is the optimal number of clusters to have when using K-Means? K-Means++ is a program that can be used to choose the initial values, or cluster centroids, for K-Means.

In K-Means Clustering, how do you choose the right number of clusters?

When dealing with K-Means, one of the most common questions is how to choose the correct number of clusters.

So, let’s take a look at a method for determining the correct cluster value for the K-Means algorithm. Consider the following scenario.

As shown below, we can have two clusters that will divide the points:

We can also have 4 clusters:

To be honest, we can have as many clusters as we want. Can you estimate the maximum number of clusters that might exist? One thing we might do is allocate each point to its own cluster. As a result, the number of clusters will be the same as the number of points or observations in this situation. As a result,

The number of clusters that can be formed is limited to the number of observations in the dataset.

But how do we determine the ideal number of clusters? Plotting a graph, also known as an elbow curve, with the x-axis representing the number of clusters and the y-axis representing an evaluation metric is one choice. For the time being, let’s call it inertia.

Next, we’ll start with a small cluster value, say 2, and work our way up. Train the model with two clusters, measure its inertia, and finally plot it in the graph above. Let’s assume we have a value of 1000 inertia. We’ll now multiply the number of clusters, retrain the model, and plot the inertia value. The plot is as follows:

The inertia value dropped dramatically when the cluster value was increased from 2 to 4. If we increase the number of clusters, this decrease in inertia value decreases and gradually becomes constant.

Therefore,

The cluster value for our data can be selected as the one where the decrease in inertia value remains constant.

We have the option of selecting any number of clusters between 6 and 10. We may have as many as seven, eight, or even nine clusters. When deciding on the number of clusters, you must also consider the cost of computation. The cost of computation will rise as the number of clusters increases. So, if you don’t have a lot of computing power, my recommendation is to use fewer clusters.

Evaluating the cluster quality

It’s not just about making clusters; it’s about making healthy, meaningful clusters. When the datapoints within a cluster are close together and apart from other clusters, this is referred to as quality clustering.
The following are the two methods for determining cluster quality:

Inertia: Inertia, on the surface, shows how far apart the points in a cluster are. As a result, a small amount of inertia is desired. The value of inertia ranges from zero to one hundred percent.
Silhouette score: The silhouette score indicates the distance between datapoints in one cluster and datapoints in another cluster. The silhouette score ranges from -1 to 1. It is preferable to have a score of 1 rather than -1.

In this post, we looked at K-Means, one of the most well-known clustering algorithms. We designed it from the ground up and studied every step of the process. We discussed the difficulties that can arise when using K-Means, as well as how K-Means++ may aid in the initialization of cluster centroids. Finally, we introduced k-means and examined the elbow curve, which aids in the K-Means algorithm’s quest for the optimal number of clusters.