K-Means Clustering: A very simple overview

Thiago Ricieri
Jul 20, 2017 · 2 min read

This week I spent some time reviewing K-Means Clustering algorithm for Unsupervised Learning.

Imagine you have a data set and you would like to know if there are clusters that could describe the relationship between the data points. If you plot them into a chart, most likely you will notice with your human eyes how they divided in groups. But how to make the computer notice that?

That’s what K-Means Clustering does: it is a set of processes or algorithms which allows you to define those groups without the need of knowing what the groups actually are. The machine will decide that for you.

Easy way to get started

An easy way to get started with K-Means is follow the process:

  1. Choose the amount of clusters you want to extract from the data;
  2. Place them as centroids, randomly;
  3. Calculate what points are closer to each centroid;
  4. Change the centroid position to center in the average distance of all points it held in its cluster;
  5. Repeat until you have a clear boundary dividing the data set.

The problem with this approach is that if you set the centroids randomly, you might get different clusters each time you try the process again. It’s a hill climbing problem: the result depends on the starting point. It’d be necessary to initialize the algorithm a couple times to get the most commons clusters found by the process.

Sklearn has a classifier that makes this process very easy. You just have to set the main values, n_clusters, n_init and max_iter and you are ready to go.

Further reading:

)

Thiago Ricieri

Written by

Lead Software Engineer at @PlutoTV

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade