K-means clustering from scratch

Ryan Mecking
4 min readAug 25, 2020

--

I’ve been doing Data Science projects for awhile, using the core libraries such as pandas, numpy, sklearn… the list goes on. While I still use all of their documentation for reference, I have yet to dive deep and look at how some of those imports or their pre-built algorithms work.

below I’ll go over:

  1. Why I chose to do K-means
  2. Links to how I built it and how it traditionally works
  3. How my version worked verses sklearns version

Why did I choose K-means clustering?

I love the versatility of unsupervised learning, though K-means can also be semi supervised. Grouping data with as little bias as possible really speaks to my inner scientist. K-means does have its faults though. It does tend to try to equalize the data points per centroid, so it can “over group” the data as demonstrated below.

What is K-means clustering?

For those who don’t know what K-means clustering is, it looks for patterns in data and then groups the data based on those patterns. Any number of groups can be defined by K, and the data is clustered based off of the proximity to that point. At the center of those clusters a centroid is placed to classify the current data and also to help classify new data. The main benefit of this method is it allows the data to classify itself organically, rather than assigning it labels ahead of time which could create bias.

Implementing K-means clustering

For usability I created a Google Colab notebook. If you want to try it out yourself or use it on a different data set just follow the link.

K-means clustering was actually fairly simple to set up. The first thing I did was set up a class.

laying out:

  1. How many centroids I wanted defined as k
  2. The tolerance (which is how much the centroids will move on each iteration)
  3. The maximum number of iterations.

I also implemented a random state seed to help with reproducibility. All of the values chosen were from the sklearn library except for the K value which I chose because of the iris data set. I knew there were going to be 3 groupings.

Choosing the right K(can also be notated as n_clusters) is important. If you’re going to use your own data, there is an easy way to figure out the correct number of K to use through the sum of squared errors (SSE). As seen to the left the graph bends at 3. This is also called an elbow graph and can be useful in visualizing the best K value to use.

Setting up the fit method took quite awhile, but after a bunch of trial and error, I was finally able to get a working model. The fit methods purpose is to do all the heavy lifting. Its main purpose is to calculate distances between data points through a chosen algorithm like Euclidean distance. In researching how to implement Euclidean distance I found it was a bit more efficient to use numpy’s (linalg.norm) . To compare results, I used the library with K-means clustering built in that I mentioned earlier, sklearn.

The sklearn library has a few ways to calculate the clusters, but I decided to just place the initial centroid of the first data points then moved it based off of the averages. As shown below, the left is the sklearn library while the right is my implementation. Though the data was plotted similarly the centroids were calculated differently which will ultimately give us different classifications.

Finally, I needed a predict method. As long as the data is in the same shape as the data previously fit, the predict method will classify the data based on the centroids that were calculated in the fit method.

This was an interesting endeavor to undertake, and I underestimated how complex some of the prebuilt libraries such as sklearn can get. It brought a new perspective into how to better use learning and predicting algorithms. K-means clustering is a very powerful tool and my understanding of its inner workings has truly grown.

--

--