K-means Clustering from Scratch in Python

In this article, we shall be covering the role of unsupervised learning algorithms, their applications, and K-means clustering approach. On a brief note, Machine learning algorithms can be classified as Supervised and unsupervised learning. In the supervised learning, there will be the data set with input features and the target variable. The aim of the algorithm is to learn the dataset, find the hidden patterns in it and predict the target variable. The target variable can be continuous as in the case of Regression or discrete as in the case of Classification. Examples of Regression problems include housing price prediction, Stock market prediction, Air humidity, and temperature prediction. Examples of classification problems include Cancer prediction(either benign or malignant), email spam classification etc. Our article demonstrates the next category of machine learning which is unsupervised learning.

So now one of the other areas of machine learning is unsupervised learning, where we will have the data, but we don’t have any target variable as in the case of supervised learning. So the goal here is to observe the hidden patterns among the data and group them into clusters i.e the data points which have some shared properties will fall into one cluster or one alike group. So where is clustering used? Ever observed the google news where each category like sports, politics, movies, science have 1000’s of news articles? This is clustering. The algorithm groups the news articles which have common features in them into separate groups to form a cluster. Other examples of clustering include social network analysis, market segmentation, recommendation systems etc.

One of the basic clustering algorithms is K-means clustering algorithm which we are going to discuss and implement from scratch in this article. Let’s look at the final aim of the clustering from the two sample images and a practical example. The first image is the plot of the data set with features x1 and x2. You can see that data is unclustered, so we can’t conclude anything just by looking into the plot. The second image is obtained after performing Clustering, we can observe that data points which are close to each other are grouped as a cluster. In practical situations, this is can be treated as a case where we are dealing with customer data and are asked to segment the market based on customer’s income and the number of previous transactions. Suppose here x1 feature is the annual income and x2 feature is the number of transactions, based on these features we can cluster the data and segment them into three categories like customers with low annual income but a high number of transactions (like orange cluster), customers with medium income and a medium number of transactions (like green cluster), customers with high income but low number of transactions (like blue cluster). Based on these segments, the marketing team of the company can redefine their marketing strategies to get more transactions by recommending the products, which each cluster’s customer might buy.

Before Clustering
After Clustering

K-Means Clustering Intuition:

So far we have discussed the goal of clustering and a practical application, now it’s time to dive into K-means clustering implementation and algorithm. As the name itself suggests, this algorithm will regroup n data points into K number of clusters. So given a large amount of data, we need to cluster this data into K clusters.

But the problem is how to choose the number of clusters? Most of the times we will know what type of clusters we need to use, for example in the case of Google news, we already knew that say 4 types of clusters (sports, politics, science, movies) exist and news articles should be clustered into any of them. But in the case of market segmentation problem, as the goal is to cluster the type of customers, we don’t know how many types of customers are present in the dataset ( like rich, poor, who loves shopping, who doesn’t love shopping etc) i.e we don’t know exactly how many number of clusters that need to chosen and we can’t randomly assume the number of clusters based on some ground rules (like people with annual income greater than ‘x’ amount should be one cluster and lower than ‘x’ should fall into another).

We shall discuss the solution to this problem, later in the article but now let’s assume that we are going to segment our data into 3 clusters. Some of the mathematical terms involved in K-means clustering are centroids, euclidian distance. On a quick note centroid of a data is the average or mean of the data and euclidian distance is the distance between two points in the coordinate plane. Given two points A(x1,y1) and B(x2,y2), the euclidian distance between these two points is :

While implementing, we won’t consider the square root as we can compare the square of the distances and arrive at conclusions.

Algorithm:

Now let’s start talking about the implementation steps. We shall be using either cluster centers or centroids words to describe the cluster centers.

Step 1:

Randomly initialize the cluster centers of each cluster from the data points. In fact, random initialization is not an efficient way to start with, as sometimes it leads to increased numbers of required clustering iterations to reach convergence, a greater overall runtime, and a less-efficient algorithm overall. So there are many techniques to solve this problem like K-means++ etc. Let’s also discuss one of the approaches to solve the problem of random initialization later in the article. So let’s assume K=3, so we choose randomly 3 data points and assume them as centroids.

Random Initialization of Centroids

Here three cluster centers or centroids with the green, orange, and blue triangle markers are chosen randomly. Again this is not an efficient method to choose the initial cluster centers.

Step 2:

2a. For each data point, compute the euclidian distance from all the centroids (3 in this case) and assign the cluster based on the minimal distance to all the centroids. In our example, we need to take each black dot, compute its euclidian distance from all the centroids (green, orange and blue), and finally color the black dot to the color of the closest centroid.

2b. Adjust the centroid of each cluster by taking the average of all the data points which belong to that cluster on the basis of the computations performed in step 2a. In our example, as we have assigned all the data points to one of the clusters, we need to calculate the mean of all the individual clusters and move the centroid to calculated mean.

Repeat this process till clusters are well separated or convergence is achieved.

2.a Assign the centroids to each data point based on computed euclidian distances
2b. Adjust the centroids by taking the average of all data points which belong to that cluster

Repeat these steps until convergence is achieved:

Step 2.a
step 2.b
step 2.a
step 2.b

Implementation from scratch:

Now as we are familiar with intuition, let’s implement the algorithm in python from scratch. We need numpy, pandas and matplotlib libraries to improve the computational complexity and visualization of the results. The dataset which we are going to use is ‘Mall_Customers.csv’ and the link to the dataset can be found in the GitHub page. (Note: The dataset has been downloaded from SuperDataScience website).

Let’s read the dataset and get the data examples

dataset=pd.read_csv('Mall_Customers.csv')
dataset.describe()

For visualization convenience, we are going to take Annual Income and Spending score as our data.

X = dataset.iloc[:, [3, 4]].values

Now X is two matrix of shape (200,2).

Next step is to choose the number of iterations which might guarantee convergence. We need to try many possibilities to find optimum number of iterations required for convergence. There is no need to choose a very large number because say at 100th iteration, if the centroids arrived to their true location or best possible location, even after performing 1000 extra iterations, the algorithm will give same results. So for convenience let’s start by choosing number of iterations as 100.

m=X.shape[0] #number of training examples
n=X.shape[1] #number of features. Here n=2
n_iter=100

Next step is to choose number of clusters K. Let’s take 5 as K and as it has been mentioned earlier we are going to see a method later in the article, which will find us the optimum number of clusters K.

K=5 # number of clusters

We are ready to implement our Kmeans Clustering steps. Let’s proceed:

Step 1: Initialize the centroids randomly from the data points:

Centroids=np.array([]).reshape(n,0) 

Centroids is a n x K dimentional matrix, where each column will be a centroid for one cluster.

for i in range(K):
rand=rd.randint(0,m-1)
Centroids=np.c_[Centroids,X[rand]]

Step 2.a: For each training example compute the euclidian distance from the centroid and assign the cluster based on the minimal distance

The output of our algorithm should be a dictionary with cluster number as Keys and the data points which belong to that cluster as values. So let’s initialize the dictionary.

Output={}

We find the euclidian distance from each point to all the centroids and store in a m X K matrix. So every row in EuclidianDistance matrix will have distances of that particular data point from all the centroids. Next, we shall find the minimum distance and store the index of the column in a vector C.

EuclidianDistance=np.array([]).reshape(m,0)
for k in range(K):
tempDist=np.sum((X-Centroids[:,k])**2,axis=1)
EuclidianDistance=np.c_[EuclidianDistance,tempDist]
C=np.argmin(EuclidianDistance,axis=1)+1

Step 2.b: We need to regroup the data points based on the cluster index C and store in the Output dictionary and also compute the mean of separated clusters and assign it as new centroids. Y is a temporary dictionary which stores the solution for one particular iteration.

Y={}
for k in range(K):
Y[k+1]=np.array([]).reshape(2,0)
for i in range(m):
Y[C[i]]=np.c_[Y[C[i]],X[i]]

for k in range(K):
Y[k+1]=Y[k+1].T

for k in range(K):
Centroids[:,k]=np.mean(Y[k+1],axis=0)

Now we need to repeat step 2 till convergence is achieved. In other words, we loop over n_iter and repeat the step 2.a and 2.b as shown:

for i in range(n_iter):
#step 2.a
EuclidianDistance=np.array([]).reshape(m,0)
for k in range(K):
tempDist=np.sum((X-Centroids[:,k])**2,axis=1)
EuclidianDistance=np.c_[EuclidianDistance,tempDist]
C=np.argmin(EuclidianDistance,axis=1)+1
#step 2.b
Y={}
for k in range(K):
Y[k+1]=np.array([]).reshape(2,0)
for i in range(m):
Y[C[i]]=np.c_[Y[C[i]],X[i]]

for k in range(K):
Y[k+1]=Y[k+1].T

for k in range(K):
Centroids[:,k]=np.mean(Y[k+1],axis=0)
Output=Y

Now it’s time to visualize the algorithm and notice how the original data is clustered. To start with, let’s scatter the original unclustered data first.

plt.scatter(X[:,0],X[:,1],c='black',label='unclustered data')
plt.xlabel('Income')
plt.ylabel('Number of transactions')
plt.legend()
plt.title('Plot of data points')
plt.show()

Now let’s plot the clustered data:

color=['red','blue','green','cyan','magenta']
labels=['cluster1','cluster2','cluster3','cluster4','cluster5']
for k in range(K):
plt.scatter(Output[k+1][:,0],Output[k+1][:,1],c=color[k],label=labels[k])
plt.scatter(Centroids[0,:],Centroids[1,:],s=300,c='yellow',label='Centroids')
plt.xlabel('Income')
plt.ylabel('Number of transactions')
plt.legend()
plt.show()

Wow, it’s so beautiful and informative too! Our data has become a clustered data from unclustered raw data. We can observe that there are five categories of clusters which are :

  1. Customers with low income but a High number of transactions (For these type may be the company can recommend products with low price) — Red cluster
  2. Customers with low income and a low number of transactions (Maybe these type of customers are too busy saving their money) — Cyan Cluster
  3. Customers with medium income and a medium number of transactions — Green Cluster
  4. Customers with High income and a low number of transactions — Magenta Cluster
  5. Customers with High income and a High number of transactions — Blue cluster.

So the company can divide their customers into 5 classes and design different strategies for a different type of customers to increase their sales. So far so good right, but how do we arrive at the conclusion that there should be 5 number of clusters? To find out how, let’s visualize the clustered data with different clusters starting from 1 to 10.

Now we see a lot of plots showing the clustered data with a different number of clusters. So the question is what is the best value for K? Suppose we have n data points, and we choose n clusters i.e every data point is a cluster, is it a good model? No, obviously not and the converse is also not a good model i.e one cluster for n data points. So how do we find the appropriate K? The answer lies in the fact that for every data point, it’s cluster center should be the nearest or in other words, “Sum of squares of distances of every data point from its corresponding cluster centroid should be as minimum as possible”. This statement will again contradict with the fact that if all data points are treated as individual clusters, then the sum of squares of distances will be 0. So to counter this fact, we use a method called ELBOW method to find the appropriate number of clusters. The parameter which will be taken into consideration is Sum of squares of distances of every data point from its corresponding cluster centroid which is called WCSS ( Within-Cluster Sums of Squares).

Steps involved in ELBOW method are:

  1. Perform K means clustering on different values of K ranging from 1 to any upper limit. Here we are taking the upper limit as 10.
  2. For each K, calculate WCSS
  3. Plot the value for WCSS with the number of clusters K.
  4. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters. i.e the point after which WCSS doesn’t decrease more rapidly is the appropriate value of K.

So let’s implement this and see how K=5 is the best.

Note: I have converted the algorithm into an object-oriented manner. Kmeans is the name of the class, fit method will perform the Kmeans Clustering, and predict will return the Output dictionary and Centroid matrix. The actual code can be found at Github link.

WCSS_array=np.array([])
for K in range(1,11):
kmeans=Kmeans(X,K)
kmeans.fit(n_iter)
Output,Centroids=kmeans.predict()
wcss=0
for k in range(K):
wcss+=np.sum((Output[k+1]-Centroids[k,:])**2)
WCSS_array=np.append(WCSS_array,wcss)

Now let’s plot the WCSS_array with clusters from 1 to 10.

K_array=np.arange(1,11,1)
plt.plot(K_array,WCSS_array)
plt.xlabel('Number of Clusters')
plt.ylabel('within-cluster sums of squares (WCSS)')
plt.title('Elbow method to determine optimum number of clusters')
plt.show()

Now if we observe the point after which there isn’t a sudden change in WCSS is K=5. So we choose K=5 as an appropriate number of clusters. So for any given dataset, we need to find the appropriate number of clusters first and start making predictions and alterations to the conclusions.

So are we done yet? No. we have one more problem left which is random initialization. This is certainly a problem because if two initial centroids are very near, it would take a lot of iterations for the algorithm to converge and so something need to be done in order to make sure that initial centroids are far apart from each other. This is Described as KMeans++ algorithm, where only the initialization of the centroids will change, rest everything is similar to conventional KMeans.

The objective of the KMeans++ initialization is that chosen centroids should be far from one another. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point’s closest existing cluster center. This will be understood well if we represent it as steps.

Steps involved in Kmeans++ initialization are:

  1. Randomly select the first cluster center from the data points and append it to the centroid matrix.
  2. Loop over the number of Centroids that need to be chosen (K):
  3. For each data point calculate the euclidian distance square from already chosen centroids and append the minimum distance to a Distance array.
  4. Calculate the probabilities of choosing the particular data point as the next centroid by dividing the Distance array elements with the sum of Distance array. Let’s call this probability distribution as PD.
  5. Calculate the cumulative probability distribution from this PD distribution. We knew that the cumulative probability distribution ranges from 0 to 1.
  6. Select a random number between 0 to 1, get the index (i) of the cumulative probability distribution which is just greater than the chosen random number and assign the data point corresponding to the selected index (i).
  7. Repeat the process until we have K number of cluster centers.

Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1 , be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

Example probs=[0.1,0.2,0.3,0.4], cumprobs=[0.1,0.3,0.6,1.0]. Let’s choose a random number r between 0 and 1. if chosen r<cumprobs[0], then this event has probability 0.1, else if chosen r<cumprobs[1], then this event has probability 0.3 and so the data point corresponding to the index 1 is chosen as next cluster center.

More about Kmeans++ can be found at Wikipedia-link and the above example has been referenced from StackOverFlow-question.

Now let’s implement this initialization and compare the results with random initialization.

  1. Randomly select the first cluster center from the data points and append it to the centroid matrix.
i=rd.randint(0,X.shape[0])
Centroid=np.array([X[i]])

2. Let’s put everything into for loop later.

3. For each data point calculate the euclidian distance square from already chosen centroids and append the minimum distance to a Distance array.

D=np.array([]) 
for x in X:
D=np.append(D,np.min(np.sum((x-Centroid)**2)))

4. Calculate the probabilities of choosing the particular data point as the next centroid by dividing the Distance array elements with the sum of Distance array. Let’s call this probability distribution as P.

prob=D/np.sum(D)

We can’t just select the data point with the highest probability as the cluster center because there might be a chance that the selected cluster will coincide with the previous cluster centers.

5. Calculate the cumulative probability distribution from this PD distribution. We knew that the cumulative probability distribution ranges from 0 to 1.

cummulative_prob=np.cumsum(prob)

6. Select a random number between 0 to 1, get the index (i) of the cumulative probability distribution which is just greater than the chosen random number and assign the data point corresponding to the selected index (i).

r=rd.random()
i=0
for j,p in enumerate(cummulative_prob):
if r<p:
i=j
break
Centroid=np.append(Centroid,[X[i]],axis=0)

Putting together in a for loop for K clusters:

i=rd.randint(0,X.shape[0])
Centroid=np.array([X[i]])
K=5
for k in range(1,K):
D=np.array([])
for x in X:
D=np.append(D,np.min(np.sum((x-Centroid)**2)))
prob=D/np.sum(D)
cummulative_prob=np.cumsum(prob)
r=rd.random()
i=0
for j,p in enumerate(cummulative_prob):
if r<p:
i=j
break
Centroid=np.append(Centroid,[X[i]],axis=0)

The final part of the article is to visualize the results and see how Kmeans++ solves the problem of random initialization.

Centroid_rand is the randomly chosen cluster centers and Centroid is the ones obtained from Kmeans++.

for i in range(K):
rand=rd.randint(0,m-1)
Centroids_rand=np.c_[Centroids_rand,X[rand]]
plt.scatter(X[:,0],X[:,1])
plt.scatter(Centroid_temp[:,0],Centroid_temp[:,1],s=200,color='yellow')
plt.scatter(Centroids_rand[0,:],Centroids_rand[1,:],s=300,color='black')

Yellow dots are the cluster centers chosen using Kmeans++ algorithm and Black dots are cluster centers chosen using Random initialization.

It is clear that yellow dots are spread wide over the plot and black dots are much closer. This will definitely show an impact in improving the computational complexity of the Kmeans clustering algorithm.

Finally, let’s implement all these steps using the sklearn library so that we can compare the results:

#lets implement the same algorithm using sklearn libraries
# Using the elbow method to find the optimal number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

We can observe that the curve is much smoother than our implemented curve.

Centroid Initialization in sklearn happens using k-means++ parameter as shown.

# Fitting K-Means to the dataset
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)
# Visualising the clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

As a conclusion, in this article,

  1. We have learned K-means Clustering from scratch and implemented the algorithm in python.
  2. Solved the problem of choosing the number of clusters based on the Elbow method.
  3. Solved the problem of random initialization using KMeans++ algorithm.

So what’s next:

  1. You can try with the different number of iterations and see how convergence is affected.
  2. There can be many other ways to find the appropriate number of clusters not just by Elbow method.
  3. There can be many other ways to solve the random initialization method not just by Kmeans++.

We shall discuss Hierarchical clustering in the upcoming article.

All the code samples are present in this GitHub link.

Thank you for reading.

--

--