K-means Clustering Clearly explained
We have taken a look at linear and logistic regression and how to implement both algorithms in Python. These algorithms are examples of supervised machine learning algorithms since they take a final value output. We will now go on to look at our first unsupervised machine learning algorithm for this series.
Please consider watching this video if any section of this article is unclear:
What is K-means clustering?
K-means clustering is an unsupervised machine learning algorithm, where its job is to find clusters within data. We can then use these clusters identified by the algorithm to make predictions for which group or cluster a new observation belongs to.
Imagine we recorded some data and made a scatter plot:
The job of K-means clustering is to identify clusters or groups within the data:
This is useful since if we were to record a new observation, we can identify which group this observation most likely belongs to through k-means clustering.
Choose the number of clusters or groups (k) you wish to put the data into, let us choose 3.
We will later discuss a method in which the computer can find the best number of clusters for us.
Randomly initialise k points, the are called centroids.
Identify the points closest to each centroid:
Calculate the mean distance our centroid is away from each point in each cluster and move our centroid to that position.
Repeat steps 3 and 4 until the mean of the points in each cluster remains the same:
- We have successfully identified the three clusters in our algorithm. But this may not always be the case:
- Sometimes we have an initialisation of centroids (Step 2) that results in clusters that seem wrong. For example:
- To prevent this from happening we ensure we do many random initializations.
- To choose which resulting clusters from our many random initializations suit our data best we look at the objective function.
The Objective Function
To objective function is given by the following formula:
In this formula we are:
- Looping over all centroids
- Calculating the distance each centroid is away from the data points in its cluster
- Squaring and summing these distances
For each random initialization we choose the final clusters produced with the lowest objective function value. So in the following case:
We would choose the clusters produced from random initialization 1 since this produces the smaller objective function value. In general, for more data, we would do more random initializations. Other initialization methods, which improve the efficiency of the algorithm can be read here: Link.
Choosing the number of clusters
From the above graphs, it was clear to see that 3 clusters seemed suitable for our data. Sometimes this is less obvious, for example how many clusters should we choose for the following data?
It is important to note that in general:
- The more clusters or centroids we include, the less the overall objective function value.
- If our number of clusters or centroids is equal to our number of data points, then this would produce an objective function value of 0. This is because we would have enough centroids to place on every data point.
The Elbow Method
To decide how many clusters k to include we plot our objective function values also known as intertia, against number of clusters K. We may get a result that looks like this for some dataset:
To find the optimal number of clusters, we select the value of K at the elbow of the graph. This is the point just before the inertia (objective function) starts decreasing linearly. So from the above graph we would choose 3 clusters.
Considerations of K-means clustering
- Simple and quick to implement
- Works well on large datasets
- Adapts well to new examples (data points)
- Easy to interpret
- Must select number of clusters K manually which is time consuming
- Clusters are effected heavily by outliers.
- Different random centroid initializations may give different clusters — requires many initializations.
In the next episode we will be implementing K-means clustering to a real-life dataset using Python.
- K-means clustering is a common unsupervised machine learning algorithm that is used to cluster data into groups
- We do many initializations of centroids to ensure we select for clusters that minimise our objective function
- The elbow method is used to choose the number of clusters for our data to be put into