Unsupervised Machine Learning (KMeans Clustering) with Scikit-Learn

Charles Rajendran
Ascentic Technology
6 min readMay 7, 2020

Machine learning can be divided into two main categories, supervised machine learning and unsupervised machine learning. In supervised machine learning, we initially provide the data with its corresponding label to train the model, with the trained model we can find the label for new data.

But in unsupervised machine learning, we throw the data to the model without any labelling, the model will find patterns in the data. In Unsupervised machine learning, we can’t find the class of the data, but instead, we can group the data points that are similar, this process is known as clustering. There are a number of clustering algorithms, in this article I will talk about KMeans Clustering.

How KMeans Clustering work?

Let’s understand this step by step, with the below image

Step (a) — Unsupervised Initial Data

Step (b) — Choose random initial centroids (centroids are the centre of the clusters.), In this example, we need to separate the data set into two different clusters either red or blue, therefore we have two centroids.

Step © — Then the other data points will be marked as either red or blue based on the closest centroid.

Step (d) — After marking the points as red or blue, The coordinate of the centroids will be adjusted. The new coordinates of the centroids will be calculated by the mean value of all the points belonging to the centroid’s cluster.

Step (e) — After the centroid adjustment, again the other data points in the cluster will be changed based on the closest centroid (As you can see in the (e) image some blue points have changed to red points and vice versa.).

Step (d) and Step (e) will be done iteratively till there are no cluster changes for any points, In other words, the centroids will be adjusted till we reach a level where none of the data points will change from one cluster to another (see image (f)). If you still confused see the below gif.

Let’s do this practically, In this example, I am going to use a data set of a grocery shop’s daily transaction of a month and I want to find some pattern in the product sales against the products price.

Step 1 — Import the data and necessary libraries.

import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('Data.csv')

In the above code, I have imported the necessary libraries to the data. If we look at our data variable, it will look something like this.

Step 2 — Find the optimal number of clusters(K value) for the data. To find the optimal number of clusters, we use a method call elbow method.

What is Elbow Method?

To find the optimal number of clusters or the so-called K-value, we will perform the KMeans algorithm to a number of different K-values and calculate the error (known as the within-cluster sum of squares) and plot it into a graph against k-value, and finally make a decision on how much to choose.

What is WCSS or within-cluster sum of squares?

It is the sum of the square of Euclidean distance between the centroid and all the cluster points of that cluster. In simple terms, it will calculate the distance between the centroid and each point belongs to that cluster and add everything together.

X = data.iloc[:, 1:3]

# use elbow mwthod to find optimal number of clusters
from sklearn.cluster import KMeans

# with in cluster sum of squares
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters =i, init="k-means++", max_iter=300, n_init=10)
kmeans.fit(X)
wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss);
plt.title("Elbow Method")
plt.xlabel("Number of Clusters")
plt.ylabel("WCSS")
plt.show()

In this above code first, we choose the fields in the data set that are important, we only need the sold qty and unit price. Then we have chosen the different number of clusters (here we test with values from 1 to 10) and applied the KMeans algorithm (ignore the parameters in the function for now) and calculated the WCSS (WCSS value is obtained with kmeans.inertia_). Finally, we plot the WCSS values against the cluster number and obtained a graph (the graph above).

So we have the graph, now we need to know how to choose the optimal k-value, actually, the WCSS value will decrease with the increase of the number of clusters (it will reach zero when the number of clusters is equal to the number of the data points because the centroids and the data point will be in the same place so no error at all) but we will choose the point where the last big difference in WCSS happens, here, we can choose either 3 or 4. In this example, I will choose 4 clusters. Once you find the optimal number of clusters then you don’t need this piece of code in your program, so you can comment on this code.

Step 3 — Cluster the data with the chosen k-value.

kmeans = KMeans(n_clusters=4, init="k-means++", max_iter=1000, n_init=10)
y_pred = kmeans.fit_predict(X)

Let’s go through the arguments in KMeans,

1. n_clusters — Number of clusters

2. init — KMeans algorithm has a major issue when it comes to selecting the random initial centroid location. For example in the below image,

The first graph shows the correct clusters, the second graph shows the initialization trap I talked about earlier. As you can see in the second graph even though the centroids separate the data points correctly, this is not the correct clusters. To overcome the Initialization Trap the KMeans++ algorithm can be used( know more about kmean++).

3. max_iter — This is to define how many iterations at most we take to adjust the centroids, if we don’t limit this, then the centroids adjustment process will be performed over and over again, which might take a very long time.

4. n_init — Number of time the k-means algorithm will be run with different centroid seeds. This means our KMean model will test 10 different centroid location and finally among the 10 find the optimal initial location.

Then we have stored the cluster of the records in y_pred variable.

Step 4 — Visualize the cluster

#plot the scatters
'''
X[y_pred==0] - will list records which belongs to cluster 0
output:
Qty UnitPrice
1 1911 3.39
'''
plt.scatter(X[y_pred == 0].iloc[:, 0], X[y_pred == 0].iloc[:, 1], s=5, c="red")
plt.scatter(X[y_pred == 1].iloc[:, 0], X[y_pred == 1].iloc[:, 1], s=5, c="green")
plt.scatter(X[y_pred == 2].iloc[:, 0], X[y_pred == 2].iloc[:, 1], s=5, c="blue")
plt.scatter(X[y_pred == 3].iloc[:, 0], X[y_pred == 3].iloc[:, 1], s=5, c="purple")
# centroids X, Y Coordinates can be get through kmeans.cluster_centers_
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=100, c="black", marker="*")
# I limit the y value to get rid of the outlier records
plt.ylim([0,20])
plt.xlabel("Sold Quantity")
plt.ylabel("Unit Price")
plt.show()

The final output looks like the one below. You can see how the model as grouped the data points into 4 clusters and the stars in the graph indicate the centroid points.

That’s it 😉.

Note: All my code is available in Github and feel free to follow me on Github 😉.

--

--