Learn K-Means and Hierarchical Clustering Algorithms in 15 minutes

Published in

SFU Professional Computer Science

15 min readFeb 10, 2022

Authors: Haonan Wu, Jingyan Sun, Lingxiao Song, Ziyue Cheng

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/mpcs}.

What are Clustering Algorithms?

Clustering is a machine learning technique that allows us to group similar objects together and categorize dissimilar objects into different groups, and it is a very important tool in data analysis. Clustering is an unsupervised learning algorithm as it works on unlabeled data and automatically divides the data into clusters without having been told how the groups should look ahead of time.

In this post, we will concentrate on two of the most commonly used approaches: K-Means Clustering and Hierarchical Clustering.

K-Means Clustering Algorithm

K-means is an iterative algorithm that aims to partition data into K clusters so that each observation belongs to only one of the K clusters. The objective is to have a minimal “with-cluster-variation”, i.e., the elements in the same cluster should be as similar as possible. K-means achieves this goal by minimizing the distance of the points in a cluster with their centroid (arithmetic mean of all data within a cluster). In Euclidean geometry, we measure the pair-wise squared Euclidean distances. A point is considered to be in a particular cluster if it is closer to that cluster’s centroid than any other centroid.

We can treat K-means as an optimization problem with the objective function J as follows:

Where the superscripts n and k refer to the data index and cluster index respectively and k refers to the centroid for the kth cluster. Note that the rₙₖ = 1 for data point xₙ if it belongs to cluster k and rₙₖ = 0 otherwise.

The goal is to find the centroids that minimize the objective function J above and we do so by solving ∂J / μₖ = 0, which gives the result as follows.

The idea of the K-means algorithm can be summarized as follows:

Choose the number of clusters (K) that we want to categorize into and randomly pick K data points as the centroids (center of the cluster) of each cluster.
Measure the distance between all data points and all centroids, group each data point into the closest cluster (centroid).
Update the centroids by calculating the average of all data points within each cluster.
Keep iterating steps 2 and 3 until either there is no change to the centroids or the maximum iteration number is reached.

Parameters of K-Means in Scikit-Learn

The full list of parameters in sklearn.cluster.KMeans includes n_cluster, init, n_init, max_iter, tol, verbose, random_state, copy_x and algorithm. We need to pay special attention when choosing the values for the below parameters:

n_clusters: The number of clusters to form as well as the number of centroids to generate, which is K, default=8

Sometimes it’s hard to figure out what value of K should we use, luckily, there are some methods that help to find the optimal K.

Method 1: The Elbow method

The Elbow method is one of the most popular methods to find the K value. The main idea of this method is to calculate the sum of the squared distance between each data point and the centroid within a cluster for different values of K, this value is also called Within-Cluster-Sum of Squared Errors (WCSS), and choose the K for which WCSS begins to decrease in a linear fashion, which is the elbow point in the WCSS vs K plot. In sklearn library, we have 2 ways to do this, one is by KMeans’ attribute inertia_. However, if the data set is loose, we may have a smooth curve and the k value will be unclear to find. such as in the below figure, k values between 3 and 7 all can be the Elbow points.

So, there is an implemented library called yellowbrick.cluster import KElbowVisualizer which can show the k value.

Method 2: Silhouette Coefficient

The value range of the Silhouette coefficient/score is between -1 to 1.

1: Means clusters are well apart from each other and clearly distinguished.
0: Means clusters are indifferent, or we can say that the distance between clusters is not significant.
-1: Means clusters are assigned in the wrong way.

So we need to find a k value that can get the Silhouette score most close to 1.

To calculate the Silhouette score we defined the intra-cluster distance between the sample and other data points within the same cluster as a and the inter-cluster distance between the sample and the next nearest cluster as b.

From the figure below we can see 3 is the best k value.

Method 3: Gap statistic

Without using the function on sklearn, we can also find the k value by statistic way

This method is based on the Gap Method by ‘Estimating the number of clusters in a data set via the gap statistic’ and we want to find the k has the maximizing gap value.

Such as in below figure 7 is the best k value in this method.

See more details via https://github.com/milesgranger/gap_statistic

We need to pay attention that none of the above methods can find the totally correct k value of K-Means. For example, in the above method, we used the same dataset but some find 3 is the best k value and some are 7. So such methods can provide more information to help you to choose k and even you do not choose a perfect k value, sklearn KMeans also can work by other parameters to find a suitable label.

2. init: how to initiate the cluster centers

The most commonly used values for init are ‘k-means++’ and ‘random’ with k-means++ as the default.

K-means++: the algorithm that selects initial cluster centers for K-means clustering in a smart way to speed up convergence. The idea is to pick up centroids that are far away from one another. This increases the chances of initially picking up centroids that lie in different clusters. Also, since centroids are picked up from the data points, each centroid has some data points associated with it at the end.

random: choose n_clusters observations (rows) at random from data for the initial centroids.

In most cases, k-means++ has fewer number iterations run than random.

Hierarchical Clustering Algorithm

The idea behind hierarchical clustering algorithms is quite different. Unlike the K-Means algorithm, the hierarchical algorithm does not require a representative for each cluster. Instead, It assumes that the clusters are hierarchically structured. That is, every cluster is made up of some smaller clusters. It treats the whole dataset as one big cluster and the sub-clusters contained in it are the partitions of the parent cluster. This assumption can be applied to its sub-clusters. The smallest unit will be a cluster that comprises only one singleton data point. The assumption of the algorithm allows us to interpret the resulting cluster structure as a tree, where the leaves of the tree represent the data points (clusters containing only one object) and the inner nodes represent the collection of all data points contained in its corresponding subtrees. Therefore, we can think of the process of hierarchical clustering algorithm as constructing a tree.

There are two different approaches to achieve that:

Build the tree from bottom to top — this is also known as the agglomerative approach
Build the tree from top to bottom — this is also known as the divisive approach

Agglomerative Approach

The first approach is quite simple. Initially, the algorithm treats every data point as a separate cluster. This is an analogy to constructing the leaves of the tree. At this level, each data point is a cluster of itself. This is totally a valid partition for the dataset. However, we want to do better than this. The algorithm will start merging similar clusters iteratively. At each iteration, it will find the most similar two clusters and merge them together to form a new cluster. This newly created cluster will become a bigger tree with two child nodes where each of the nodes represents a cluster that was merged. The bigger cluster as a whole will be treated as one single object in the next iteration. If we continue the process and keep merging similar pairs of clusters one at a time, we will eventually combine all data points into one big binary tree. That’s when the algorithm will stop.

Note that we are reducing the number of clusters one at a time. Therefore, we could potentially get any number of clusters (between 1 and the size of the dataset) by stopping at different levels of the tree. Below is the pseudocode for this agglomerative approach:

# Agglomerative (Bottom-up) algorithm
Form initial clusters (i.e., turn each data point into a cluster).
Compute distance between each pair of clusters.
while number of clusters > 1:
    Merge the two clusters with minimum distance.
    Calculate the distance between the new cluster and all other clusters.

Divisive Approach

As we mentioned, there is another type of hierarchical clustering algorithm called the divisive algorithm. Similar to the agglomerative one, it also tries to construct a binary tree (dendrogram). However, instead of building the tree bottom-up, the divisive algorithm starts from the root of the tree and constructs the child nodes iteratively. It initially treats the whole dataset as one cluster (i.e., root node). At each iteration, it chooses a cluster to split into two least similar clusters (i.e., child nodes). It stops until we obtain the desired number of clusters. Below is the pseudocode for the divisive algorithms:

Divisive:

# Divisive (Top-down) algorithm
Form initial cluster (i.e., turn the whole dataset into one big cluster).
while number of clusters <= number of data points:
    choose a cluster and split it into 2 clusters

Which cluster to choose to split and how?

As you may already notice, the divisive approach seems to be more complex and harder to implement. There are two problems we need to think about:

After the first iteration of the algorithm, we will have more than one cluster that we can choose to split. Which one should we choose?
After choosing the cluster, how should we split it into two least similar sub-clusters?

For the first problem, one commonly used technique is to split the cluster with the largest Sum of Square Error (SSE) value. In case you don’t know what is SSE, here’s the formula:

It is the sum of squared differences between each data point in a cluster and the mean of the cluster. It measures the variation within that cluster. In another word, if the SSE value of a cluster is large, it means the data points in that cluster tend to be more spread out, and therefore splitting that cluster is more likely to be optimal.

Now that we have chosen a cluster to split, we come to the second question, which is how to split it into two smaller clusters. This step can be done by using a flat clustering method like the K-Means algorithm. We simply have to set k=2, it will produce two sub-clusters such that the variance is minimized.

Similarity functions

The above sections illustrated the steps of the two types of hierarchical clustering algorithms. However, one problem remains to be solved. That is, how do we know if two clusters are similar or not. The simplest situation would be having two clusters where each of them consists of only one data point. In this case, the Euclidean distance function is one of the most commonly used techniques to measure similarity. Having a small distance between two points means they are more likely to be similar to each other, and therefore will be more likely to be in the same cluster.

When we are dealing with clusters consisting of more than one data point, there are several metrics to measure the similarity between clusters:

Let be two clusters and be a distance function for pairs of objects:

Single linkage: take the minimum distance between two closest points in two clusters

Complete linkage: take the maximum distance between two farthest points in two clusters

Average linkage: take the average distance between all points in two clusters

Ward linkage: minimize the increase of Sum of Square Error (SSE): Join the two clusters that minimize the SSE

Parameters of Hierarchical Clustering Algorithms in Scikit-Learn

Since agglomerative clustering is much more widely used in the industry, we will focus on this kind of algorithm. There are four commonly used parameters in agglomerative clustering.

1. Linkage

The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of clusters that minimize this criterion.

As introduced above, there are four linkage options in the Python Scikit-Learn library: ‘ward’, ‘average’, ‘complete’, and ‘single’. Agglomerative cluster has a “rich get richer” behavior that leads to uneven cluster sizes. In this regard, single linkage is the worst strategy, and Ward gives the most regular sizes. However, the affinity (or distance used in clustering) cannot be varied with Ward, thus for non-Euclidean metrics, the average linkage is a good alternative. Single linkage, while not robust to noisy data, can be computed very efficiently and can therefore be useful to provide hierarchical clustering of larger datasets. Single linkage can also perform well on non-globular data.

2. Affinity

Affinity is the metric used to compute the linkage, in particular Euclidean distance (l2), Manhattan distance (l1), cosine distance, or any precomputed affinity matrix.

Euclidean distance is the most used. This is the default distance metric used to measure the distance between clusters and is simply the straight-line distance between two points. If linkage is “ward”, only “Euclidean” is accepted.
Manhattan distance is the sum of the absolute difference between points across all the dimensions. This works as if there was a grid-like path between the points. l1 distance is often good for sparse features, or sparse noise: i.e. many of the features are zero, as in text mining using occurrences of rare words.
Cosine distance measures the degree of angle between two vectors. This is used when the magnitude between points does not matter but the orientation does and is often used in natural language programming. This is measured as:

The guidelines for choosing a metric is to use one that maximizes the distance between samples in different classes and minimizes that within each class.

3. Connectivity constraints

Connectivity constraints are used to restrict which clusters to join. These constraints specify which examples are considered connected and only clusters with connected examples, from one cluster to the other, can be joined into larger clusters. This helps solve some problems like the Swiss-roll example illustrates (Figure below). Without connectivity constraints (left figure), clusters extend across overlapping folds of the roll. This is because the Ward linkage only tries to minimize the distances between the points, the clusters include examples across the gap separating different stretches of the Swiss-roll where the data is structured. However, in the right graph, we add a connectivity constraint to restrict the connection of each example only to the 10 nearest neighbours using the kneighbors_graph function from the neighbours module of the Scikit-learn. In this way, it creates a graph of connections that respects the structure of the data and forbids the merging of points that are not adjacent on the Swiss-roll.

4. N_clusters / distance_threshold

Even though the hierarchical clustering algorithm doesn’t need to define the number of clusters in advance, we still need to decide the number of clusters for data analysis. To get the number of clusters for hierarchical clustering, we make use of Dendrogram which is a tree-like diagram that records the sequences of merges or splits. Whenever we merge two clusters, a dendrogram will record the distance between these clusters and represent it in graph form (Figure below).

To choose clusters we draw lines across the dendrogram. We can form any number of clusters depending on where we draw the breakpoint. There are no agreed criteria to trim these trees.

K-Means vs Hierarchical

We compare and contrast K-means and Hierarchical Clustering Algorithms in their efficiency, Hyperparameter Tuning, Variation & Optimazition and Robustness.

Efficiency

The efficiency of the two algorithms is quite different. The time complexity of the K-Means algorithm is given by O(n × k × t) where n is the size of the dataset, k is the number of clusters and t is the number of iterations needed for the algorithm to converge. Its space complexity is O(n(d + k)) (d is the number of attributes). Compared to the K-Means algorithm, the Hierarchical algorithm is a lot less efficient as its time complexity is O(n³) and it takes O(n²) of memory space.

Hyperparameter Tuning

Even though there seems to be a huge difference between the two algorithms in terms of time complexity, it does not necessarily mean that the hierarchical algorithm is worse, especially when we come to the hyperparameter-tuning phase. One of the most important hyperparameters for clustering algorithms is the number of clusters (i.e., the value of k). Due to the special structure of the output for hierarchical algorithms, we can store the resulting dendrogram (tree) and reuse it whenever we need it. Even though it takes a long time to construct the tree in the first place, it will be very cheap to update the k value and get the corresponding partition afterwards. We may choose any k value without having to retrain the model, which makes the hyperparameter-tuning process very efficient. The K-Means algorithm, on the other hand, requires us to retrain the model each time we try to switch to a different k value, which makes it important to choose a k value more wisely in the first place as retraining can be very time-consuming when dealing with large datasets.

Variation and Optimization

To improve the adaptability of the K-means algorithm, several variations were developed, such as K-median and K-medoid. The key difference among K-means, K-median and K-medoid comes from the distance metrics. In general, the squared Euclidean distance works well with K-means. When the Taxicab metric is used, go with K-medians. Consider K-medoids when using any other distance metrics.

An optimization technique called the K-means++ is usually used to reduce the running time of the K-means algorithm. One major drawback of the K-means algorithm is that the badly chosen initial cluster centroids would largely increase the number of iterations needed for the clusters to converge. And the K-means++ algorithm solves this problem by selecting the initial centroids in a smarter way.

For hierarchical algorithms, optimizations can be added when using the top-down approach. With the proper data structures (e.g., priority queue), the time complexity can be reduced to O(n²). Besides, early stopping can be applied when we reached the desired number of clusters.

Robustness

The performance for both algorithms varies when dealing with datasets of different shapes. K-means generally works well on the convex (hyperspherical) cluster structure while hierarchical algorithms are more robust to non-convex clusters. The K-means algorithm could get stuck in “local optimums” and fail to find the best solution. Hence, it is important to run the algorithm multiple times with random starting points to find a good solution. Besides, due to the randomness of the cluster centers choices, K-means may yield different clustering results on different runs of the algorithm whereas Hierarchical generate the same result when using the same parameter values.

Please see the table below for a summary.

Alogrithm Implement Example

To better understand all the concepts above, we use a customer segmentation example to illustrate how to implement the hierarchical clustering in Python Skit-learn. Customer segmentation is the practice of dividing a company’s customers into groups that reflect similarities among customers in each group. To see the example code please refer to https://github.com/lsa108/733-blog/blob/8c54d7657d3ee920a3a6c6f0848db3b3c1dff273/kmean%20and%20hierachical%20clustering.ipynb