K-means Clustering Clearly explained

Intro to Unsupervised Algorithms | Data Series | Episode 8.1

Mazen Ahmed
Nov 22, 2020 · 5 min read

We have taken a look at linear and logistic regression and how to implement both algorithms in Python. These algorithms are examples of supervised machine learning algorithms since they take a final value output. We will now go on to look at our first unsupervised machine learning algorithm for this series.

Please consider watching this video if any section of this article is unclear:

Video Link

What is K-means clustering?

K-means clustering is an unsupervised machine learning algorithm, where its job is to find clusters within data. We can then use these clusters identified by the algorithm to make predictions for which group or cluster a new observation belongs to.

Overview

Imagine we recorded some data and made a scatter plot:

Image for post

The job of K-means clustering is to identify clusters or groups within the data:

Image for post

This is useful since if we were to record a new observation, we can identify which group this observation most likely belongs to through k-means clustering.

The Algorithm

Step 1:

Choose the number of clusters or groups (k) you wish to put the data into, let us choose 3.
We will later discuss a method in which the computer can find the best number of clusters for us.

Step 2:

Randomly initialise k points, the are called centroids.

Image for post

Step 3:

Identify the points closest to each centroid:

Image for post

Step 4:

Calculate the mean distance our centroid is away from each point in each cluster and move our centroid to that position.

Image for post

Step 5:

Repeat steps 3 and 4 until the mean of the points in each cluster remains the same:

Image for post
  • We have successfully identified the three clusters in our algorithm. But this may not always be the case:
  • Sometimes we have an initialisation of centroids (Step 2) that results in clusters that seem wrong. For example:
Image for post
  • To prevent this from happening we ensure we do many random initializations.
  • To choose which resulting clusters from our many random initializations suit our data best we look at the objective function.

The Objective Function

To objective function is given by the following formula:

Image for post

In this formula we are:

  1. Looping over all centroids
  2. Calculating the distance each centroid is away from the data points in its cluster
  3. Squaring and summing these distances

For each random initialization we choose the final clusters produced with the lowest objective function value. So in the following case:

Image for post

We would choose the clusters produced from random initialization 1 since this produces the smaller objective function value. In general, for more data, we would do more random initializations. Other initialization methods, which improve the efficiency of the algorithm can be read here: Link.

Choosing the number of clusters

From the above graphs, it was clear to see that 3 clusters seemed suitable for our data. Sometimes this is less obvious, for example how many clusters should we choose for the following data?

Image for post

It is important to note that in general:

  • The more clusters or centroids we include, the less the overall objective function value.
  • If our number of clusters or centroids is equal to our number of data points, then this would produce an objective function value of 0. This is because we would have enough centroids to place on every data point.

The Elbow Method

To decide how many clusters k to include we plot our objective function values also known as intertia, against number of clusters K. We may get a result that looks like this for some dataset:

Image for post

To find the optimal number of clusters, we select the value of K at the elbow of the graph. This is the point just before the inertia (objective function) starts decreasing linearly. So from the above graph we would choose 3 clusters.

Considerations of K-means clustering

Advantages

  • Simple and quick to implement
  • Works well on large datasets
  • Adapts well to new examples (data points)
  • Easy to interpret

Disadvantages

  • Must select number of clusters K manually which is time consuming
  • Clusters are effected heavily by outliers.
  • Different random centroid initializations may give different clusters — requires many initializations.

In the next episode we will be implementing K-means clustering to a real-life dataset using Python.

Summary

  • K-means clustering is a common unsupervised machine learning algorithm that is used to cluster data into groups
  • We do many initializations of centroids to ensure we select for clusters that minimise our objective function
  • The elbow method is used to choose the number of clusters for our data to be put into

Prev Episode _______ Next Episode

Image for post

AI In Plain English

New AI, Machine Learning, and Data Science articles every day.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store