Clustering with K-Means: simple yet powerful

Alexandre Henrique
10 min readApr 1, 2019

--

Cluster Analysis is a multivariate statistical technique that groups observations based on some of their features or variables they are described by, such that:

  • Examples within a cluster are similar (in this case, we speak of high intraclass similarity).
  • Examples in different clusters are different (in this case, we speak of low interclass similarity)

By measuring the similarity/dissimilarity we can discover implicit patterns in the data in an unsupervised manner, even when no category label is provided.

You can easily find applications for clustering in market segmentation, astronomy, image compression, exploratory data analysis, and feature learning.

Goals

Cluster Analysis helps us describe, analyze, and gain insight into the data. On the other hand, much of what we can learn from the data depends solely on our ability to interpret the results. That’s why it is so important to understand the technique and how it works.

Here we’ll work with one of the simplest yet powerful Clustering method, the K-means method. Besides outlining what K-means is up to, we’re are going to build a model using the Python programming language with the Scikit-Learn library. Having said that, from now on I’m gonna try to answer the following questions regarding Clustering with K-means:

  1. How to perform Cluster Analysis using the K-means technique?
  2. How to find the optimal number of clusters?
  3. How to identify appropriate features?
  4. Why and when do we need to standardize the data?
  5. Which are the pros and Cons of using K-means?

The K-means method

The K-means method is based on two important mathematical concepts, Distance and Centroid.

The centroid of the blue data points

Commonly, we use the Euclidian distance as a metric to calculate the “distance” between two samples. On the other hand, the centroid is the name given to the arithmetic mean position of a set of points. For instance, you can visualize the centroid of the blue points in the figure beside.

The algorithm

The K-means algorithm divides a set of n samples X into k disjoint clusters cᵢ, i = 1, 2, …, k, each described by the mean (centroid) μᵢ of the samples in the cluster. K-means assumes that all k groups have equal variance.

Roughly speaking, The algorithm performs the following steps:

1. Choose the number (k) of clusters;2. Specify the cluster seeds (define the initial position of the centroids);3. Assign each point to a centroid based on its proximity (if the centroid α is the nearest centroid to the point p, then assign p to α, do this for all points;4. Adjust the centroid (recalculate the position of each centroid based on the points assigned to them);5. Repeat step 3 and 4, until achieving a stop criterion;6. After achieve a stop criteria (commonly the pre-defined maximum number of iterations), end execution.

In order to visually understand how the algorithm works, have a look at the following animation:

K-means in action

The centroids are represented by the yellow points. The purple, red, and green points are data points and each color is related to one of the three clusters.

Note how initially, the centroids start at random positions. At every iteration, the distance from all the data points to each centroid is calculated. After that, each point is assigned to a centroid based on proximity, and finally, the centroids move towards the mean position of the points assigned to them. The process repeats until the algorithm stops.

What are the true intentions of K-means?

The K-means algorithm solves the following minimization problem:

The K-means objective function

Where k is the number of clusters, cᵢ is the set of points that belong to cluster i and μᵢ is the centroid of the class represented by cᵢ.

DON’T PANIC!

This equation is telling us that K-means aims to minimize the distance between each point and the centroid representing its cluster. It can be shown that this is equivalent to maximize the distance between clusters. You can learn more about this on the Wikipedia page.

Drawback #1: Number of clusters

K-means clustering objective function uses the square of the Euclidean distance d(x, μⱼ). It is also referred to as inertia or within-cluster sum-of-squares. Based on what you read above, the K-means goal is to minimize inertia.

However, from the definition of inertia, arises a problem. Take a dataset with n samples and build a K-means model also with n clusters. Fatally, each of the n centroids will have exactly one observation assigned to it. This implies that the distance between points in a cluster and its centroid will be zero. This also implies that the inertia will be zero!

This astonishing result should make us happy because we reached the minimum zero, in other words, we solved the minimization problem!

But there is no point in setting the number of clusters equals the number of data points. Imagine this same case to a thousand data points!

The purpose of using K-means is to identify patterns and divide the data that share similarities to fit into these patterns. So, it’s obvious that the number of clusters, k, must always be less than the number of samples.

Mathematically, if k represents the number of clusters and n the number of data points, then k < n. Similarly, if we use only one cluster, then all data will be classified as being of that unique class, this is another useless solution! Therefore, 1 < k < n.

Drawback #2: Centroid initialization

It is also important to note that the algorithm may not ensure convergence to the global minimum. Furthermore, It can be shown that K-means will always converge to a local minimum of the inertia, which is a topic out of the scope of this article.

The fact is that K-means performance depends on the random initialization of the seeds for the centroids, but some seeds can result in a poor convergence rate or convergence to suboptimal clustering.

Drawback #3: Not standardized data

Imagine you have a dataset which consists of house prices (in dollars), sizes (in square feet), and the number of rooms. Therefore, each feature assumes values in a very distinct range. Your task is to split this dataset into clusters in order to identify different classes of houses.

We already saw that the K-means algorithm depends on the concept of distance between observations we’re trying to cluster. Given that the house prices are on a much larger scale than the other variables, when K-means calculate the “distance” of the points, this variable will overly influence the algorithm.

Consequently, without some sort of standardization, the algorithm will consider only on the prices variable. For example, take two houses, A and B.

  • House A costs 280000 US$, has 201 ft² and 4 rooms.
  • House B costs 300000 US$, has 250 ft² and 4 rooms.

The cost feature is on a scale much much larger than the size feature, and the same happens between the size feature and the number of rooms.

Okay, now comes the problem. When the algorithm calculates the “distance” of the features of A and B to a centroid, it realizes that the weights of the features for the house size and its number of rooms are considerably smaller than the cost feature.

This huge difference in scale will jeopardize the model because of the calculated Euclidean distance between features. You may be wondering if your model still will provide an output. Yes, it will!

What will happen is that your model will show you clusters of houses based only on their prices. For instance, such clusters may be called: Cheap, Reasonable and Expensive.

An implementation

Okay, now we got the idea of how K-means works. Let’s do some code!

Getting the data

First things first. Before coding K-means, we need the data. In this example, let’s create our own artificial samples just like the points you saw in the animation.

# Import the relevant packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import Kmeans
# Number of points of each cluster
MAXN = 40
# Normal distribution
data = np.concatenate([1.25*np.random.randn(MAXN, 2),
5 + 1.5*np.random.randn(MAXN, 2)])
data = np.concatenate([data, [8, 3] + 1.2*np.random.randn(MAXN, 2)])
df = pd.DataFrame(data={'x' : data[:,0], 'y' : data[:,1]})plt.scatter(df['x'], df['y'])
plt.show()
Data points

Centroid initialization

Now, let’s split the data points into three clusters (k = 3). But before that, we must initialize the centroids. We could initialize them manually by indicating their coordinates but this will bias our results. Let’s think a bit more about this issue.

So, how could we solve the initialization problem?

As said previously about the initialization drawback, we must intelligently choose the position of the initial centroids in order to not reach a local minimum.

In order to alleviate the problem of local minima, the K-means computation implemented in the Scikit-Learn library is often performed several times (Forgy method), with different centroid initializations.

Additionally, one way to address this issue is the k-means++ initialization scheme, which has been implemented in Scikit-Learn (use the init=’kmeans++’ parameter).

This parameter initializes the centroids to be (generally) far from each other, thereby probably leading to better results than random initialization. You just need to care about initialization if you are using another package different from Scikit-Learn.

Clustering

Finally, we can implement the code for K-means, It is as simple as that. See it below:

kmeans = KMeans(init = 'k-means++', n_clusters = 3)
clusters = kmeans.fit_predict(df)
plt.scatter(df['x'],df['y'], c=clusters, cmap='rainbow')
plt.show()
Clustered data

Amazing! It looks like we got a good result.

Here each cluster has no meaning but in real-life examples, you’ll be able to tell what each different cluster is about.

In this example, we assumed we had three clusters. In fact, when we created the data we draw it from three sample distributions with different means and standard deviations. In spite of having this information, how could we possibly know the best number of clusters for any problem?

The Elbow method

The best number of clusters is the one that provides the minimum inertia value with just a few clusters. Let’s take the dataset of the last example and plot a graph of the inertia vs the number of clusters.

# drop the class column in order to have only the features in df
new_df = df.copy()
# empty list for inertia values
inertia = []
for i in range(1,10):
# instantiating a kmeans model with i clusters
# init=kmeans++ is default
kmeans = KMeans(n_clusters=i)

# fitting the model to the data
kmeans.fit(new_df)

# appending the inertia of the model to the list
inertia.append(kmeans.inertia_)

# Knowing from the data that we have three samples distributions
# let's save the inertia for the case k=3
if i == 3:
elbow = kmeans.inertia_
# creating a list with the number of clusters
number_of_clusters = range(1,10)
plt.plot(number_of_clusters, inertia)
plt.plot(3, elbow, 'ro', label='Elbow')
plt.legend()
plt.xlabel('# of clusters')
plt.ylabel('Inertia')
plt.show()
Inertia versus the number of clusters

From the graph, the inertia drops at an extremely high rate in the beginning. At some point, it reaches the elbow (# of clusters equals to 3) and then there is no significant change in the inertia value from there on.

The elbow method consists in to adopt the number of clusters representing the elbow of the graph. Therefore, the ideal number of clusters for this problem is 3.

Talking about Standardization

Until now, we spoke about what is and how to solve centroid initialization and how to choose the right number of clusters to your problem. Although this is not the case for the example shown in the implementation section, let’s talk about how to deal with not standardized data.

The most common method applied to standardize variables is subtracting their mean and dividing by their standard deviation. This implies that the variables will have zero mean and unit variance. In fact, when the distribution of the variables isn’t terribly skewed, this works fine.

However, for the case when we have many outliers, it becomes a better approach to substitute the standard deviation for the interquartile range, i. e. Q₃-Q₂. By doing this, we avoid that outliers dramatically influence the algorithm.

The same strategy can be applied to the mean. When we have many outliers, the median, which is a central tendency measure, is not influenced by those outliers. Hence, it is safer to use the median.

Thus, you must judge if it is necessary or not standardizing variables. Assuming a normal distribution and no outliers, it is safe to standardize the variables using their mean and standard deviation. Conversely, if the data distribution is highly skewed and/or it presents many outliers, you can remove the outliers or standardize the variables using the median and the interquartile range.

K-means Pros and Cons

After everything we’ve been talking about so far, let’s summarise the pros and cons of using K-means. You probably have guessed who they are by now.

Pros

  1. Simple to understand
  2. Fast to cluster
  3. Widely available (there are several packages that implement K-means)
  4. Easy to implement
  5. Always yields a result (also a con, as it may be deceiving)

Cons

  1. We need to choose the number of clusters (remedy: elbow method)
  2. it is sensitive to initialization (remedy: kmeans++)
  3. Sensitive to outliers (remedy: remove outliers)
  4. Standardization

That’s all for now :( Thank you very much for reading. I hope you enjoyed our journey along the K-means clustering method path. If you want to know more about the topic, I provided a Jupyter Notebook with some valuable extra information plus more examples on my GitHub page at https://github.com/alexandrehsd/Cluster-Analysis.

I would appreciate getting your feedback and tips for possible improvements in the stories to come. Please, if you have any, leave them in the comments section! Bye Bye and happy clustering! :)

--

--