Introduction to K-means clustering

Little Dino
4 min readApr 29, 2022

--

Introduction

K-means clustering is a non-parametric and unsupervised algorithm that divides data into K non-overlapping groups.

  • Non-parametric: K-means clustering does NOT make assumptions about data’s distribution or structure.
  • Unsupervised: The class of training set is NOT be provided. Namely, K-means clustering doesn’t know the true label of training set, it just clusters the data based on their similarities.

Workflow

For example, we want to group our customers into 3 groups based on their age and annual income. We can use K-means clustering where K equals to 3, and here’s what our data looks like.

⚡ K needs to be pre-determined, and we often don’t know which K matches the real data the best.

1. Initialize algorithm by assigning centroids

Centroids are the center of groups, so we’ll have 3 centroids in this example. In specific, the coordinate of the centroid is the average value of the coordinates of all the data points within the same group.

To initialize the algorithm, we can randomly assign centroids, or we can use more robust method. To put it simply, we randomly assign 3 customers as centroids and represent them by orange (C1), green (C2), and blue (C3) dots.

The initialization of centroids will affect the clustering result! Sometimes K-means might not reach global optimum due to improper initialization.

2. Assign data to its closet centroid

Next, K-means assigns the data to the closet centroid using Euclidean distance (the shortest distance between 2 data points). For instance, the data close to orange centroid will be assign to the orange group. After doing so, we’ll have 3 groups as the graph shows.

3. Recompute the centroids

Centroid represents the center of a cluster. However, after assigning data to the clusters, the original centroid might not locate in the middle of its group anymore. Thus, we have to recompute the centroid.

4. Reassign the data (Complete the algorithm)

The next step is reassigning the data to the closet centroids (step 2) and recompute the centroids (step 3). Nevertheless, when will this loop end?

It ends when the clustering converges. When the recomputed centroids are the same as the ones we currently have, the reassignment will simply lead to the exact same grouping, and that’s when the K-means clustering stops.

In this example, the above result happens to be the converged clusters. Therefore, we successfully cluster our customer into 3 groups. From the scatter plot, we can tell the customers in green group have generally low annual income, while the age vary. On the other hand, both the customers in orange and blue groups have higher annual income, only that customers in blue group are overall younger.

To recap, we first decide K (number of clusters), initialize the algorithm by assigning K centroids. and assign the data to its closet centroid. Then, we recompute the centroids and reassign the data until the cluster converges, so it’s possible that the Reassignment and Recomputation loop has to be repeated several times.

Pros and Cons

Let’s start with advantages of K-means algorithm.

  • Interpretable — K-means groups data based on the similarities and the working flow of it is understandable for human.
  • Guaranteed to converge — K-means will not run forever. It’ll eventually stop and give you a clustering.
  • Adaptable — K-means can adapt to new data regardless the shape or distribution since it’s a distance-based algorithm.

Nonetheless, we should also be careful when using it.

  • Not guaranteed to reach the global optimum — K-means is guaranteed to converge, but the converged result is not guaranteed to be the best. In fact, the quality of result highly depends on the initialization of centroids, so smart initialization is preferred.
  • K needs to be pre-determined — in the real world, we wouldn’t know how many clusters is the best, and it’s often hard to visualize the data. Thus, we have to gather more background information or trial-and-error several times.
  • Sensitive to outliers — the coordinates of centroids are based on all the data within a cluster, so the centroid will lean towards the outliers due to their extreme location. In this article, we talked about how to detect self-defined outliers using K-means in Python. However, if the outliers are too extreme, the K-means might generate an unexpected clustering result.

References

  1. https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages

--

--

Little Dino

Welcome to my little world! I LOVE talking about machine learning, data science, coding, and statistics!