Implementing a K-Means Clustering Algorithm From Scratch

Published in

The Startup

6 min readJul 31, 2020

K-means clustering is a widely-used, and relatively simple, unsupervised machine learning model. As the name implies, this algorithm works best when answering questions in regards to how similar, or dissimilar, data objects are in our dataset. If good clustering exists in our data, then it will usually be efficiently found. But if we have poorly clustered data or data with high dimensionality, the model can suffer and produce unreliable predictions. Regardless, clustering is a very useful tool in terms of data exploration, and as a budding data scientist, an integral stepping stone in my journey. With a basic run-down out of the way, let's break down a K-means clustering algorithm from scratch and see what goes on behind the scenes so-to-speak, to gain a better understanding of the algorithm in the process.

How K-Means clusters over time/iterations

The K-Means Clustering Process:

Looking at the image above, you may be wondering, “What are those three big markers moving around altering the data with every iteration?” Well, those markers are known as centroids and they’re imperative to K-means clustering, a centroid-based clustering algorithm. Think of centroids as imaginary team captains, that gravitate toward the center of clustered data and assign data closest to them to their team. “But how does this work?”, you might ask. Lets go over the clustering process and then break it down piece by piece.

) Select n_clusters, random data points, that act as initial centroids. (One point for each cluster)
) Find the clusters of data points surrounding those centroid(s).
) Calculate new centroid(s) for the clusters.
) Repeat steps 2 and 3 until the model can no longer detect a significant change in it’s constantly updated centroid(s).

Setting up the algorithm:

Before we can get into how the algorithm operates, we must first initialize it.

The only imports you’ll need for this algorithm are numpy and scipy

This is the starter code for the algorithm, the __init__ function is your constructor, or initalizer, and it will effectively house your model’s parameters. The k in K-means usually stands for k number of centroids, but in this algorithm, n_clusters is going to be your parameter that takes care of this. There is no rule set in stone as to what the optimal number of centroids should be, but you can use the elbow method on your dataset to give you a better idea. The elbow method runs K-means clustering on the dataset for a range of values (1–15 in this case) and for each k (n_clusters) computes an average score for all the clusters.

It may look confusing at first, but a good rule of thumb is to try setting n_clusters to the value in the graph where the line bends like an elbow, as the name might imply. In this case, we’d set n_clusters to 3, telling our algorithm to initialize three centroids. The next parameter is seed, a shortened form of random_seed, which will help get consistent results despite randomness inside the algorithm. You’ll see soon that our algorithm relies on random permutations to initialize our centroids, so a random seed is necessary in assuring accurate results over multiple tests. The last parameter is max_iter, short for max iterations, a stopping point for the algorithm so that it won’t continuously run if can’t meet convergence. This will be explained further when we discuss the fit function, for the sake of brevity we’ll move on the with algorithm.

Initialize the Centroids

Now that you’ve set your parameters, mainly n_clusters, you can define your starting centroids. Let’s say we’ve run our elbow method and have decided on setting n_clusters to 3. The algorithm will start by picking 3 random data points and treating those as the starting centroids.

Assign the Clusters

With our newly initialized centroids, we need to calculate the distance of the nearest centroid to each data point. Start by getting the distances between each point and each centroid, our algorithm is accomplishing this using euclidean distance. It then calculates the minimum distance between these points and assigns a centroid to each and every datapoint, effectively finding the cluster of points surrounding those centroids.

Updating the Centroids

This is where the ‘means’ part of K-means clustering comes into play. This part of the algorithm is calculating the mean of each cluster and then reassigning the centroids in an attempt to minimize variance within the clusters. This process of assigning the clusters and updating the centroids is repeated until the variance is so small that it is negligible. At this point, your algorithm concludes that it has clustered the data as effectively as possible.

The Final Breakdown

I know we’ve covered a lot in a short period of time, lets recap with the fit statement and glance over whats happening. First we’re initalizing the starting centroids, these are outside the for loop because self.update_centroids will take care of all centroid positioning after that. Once in the loop, we instantiate the original centroids (before calculations can be taken) to utilize early-stopping later in the function. Cluster labels are assigned to these original centroids and then are applied to each centroid, updating it’s position among the data. This will continue to happen until the variance is minimal and centroids no longer sway from their last position each iteration. At this point we can run the predict function, which will gives us an array of numbers corresponding to data points and their respective centroids.

0 = 1st Centroid, 1 = 2nd Centroid, 2 = 3rd Centroid

Comparison to SK-Learns K-Means Algorithm

Let me start off by saying it’s hard to compare the two, SK-Learn has a very well polished K-means clustering model that comes equipped with more parameters and different built in algorithms to help tackle almost any clustering problem. With that said, my algorithm totally smoked it’s runtime while working through the iris dataset! (/s, for those unaware /s implies sarcasm). All jokes aside, since my model has much less to bog it down, it was able to complete it’s iterations faster, but without the timer my brain couldn’t really tell the difference. The part that impressed me more was that my algorithm produced very similar results to sk-learns. However, we’re calculating distance and tackling the problem the same way, so I guess it shouldn’t come as much surprise. Here’s a comparison of how my model fared in regards to sk-learns.