Unlocking the Power of Clustering: A Beginner’s Guide

Clustering is an unsupervised machine learning technique that involves dividing a set of unlabeled samples into groups, or clusters, based on their similarity.

Published in

MetaOr Artificial Intelligence

10 min readJan 16, 2023

Introduction

Clustering is a way to group together data points that are similar to each other. Clustering can be used for exploring data, finding anomalies, and extracting features. It can be challenging to know how many groups to create. There are two main ways to group data: hard clustering and soft clustering. In hard clustering, each data point belongs completely to one group or another. In soft clustering, each data point has a probability of belonging to each group. Clustering is a useful technique in machine learning that helps to organize data and find patterns in it.

There are many examples of clustering in different fields. Some examples include:

Customer segmentation: In marketing, clustering can be used to group customers based on their characteristics or behavior. This can help companies to understand their customers better and create targeted marketing campaigns. Later on, we will see an example on a real dataset.
Image segmentation: In computer vision, clustering can be used to divide an image into different segments, each of which corresponds to a different object or background. This can help with tasks such as object recognition or image labeling.
Biology data: Clustering can be used to analyze large datasets in biology, such as gene expression data or protein interaction data. This can help to identify patterns in the data and make predictions about biological processes.
Document ordering: Clustering can be used to group documents based on tags, topics, or content. This can be useful for organizing a large collection of documents and making them easier to search through.

Overall, clustering is a powerful technique that is used in a variety of applications to group data points based on their similarity.

Clustering Metrics

In this section, we will present evaluation cluster quality with Silhouette, Dunn, and Davies-Bouldin Indices. Remember, this is unsupervised learning — so the labels are unavailable and we cannot calculate accuracy, precision, etc.

Silhouette

Silhouette is a measure of how well a sample fits within its assigned cluster. It is calculated by comparing the average distance of a sample to all other points in its own cluster to the average distance of the sample to points in the nearest cluster. The silhouette score is a value between -1 and 1, where a score close to 1 indicates a good fit within the cluster, a score close to 0 indicates a borderline fit between two clusters, and a score close to -1 indicates a poor fit within the cluster. It is important to note that calculating the silhouette score for all samples in a dataset can be computationally expensive, so it is often calculated for a random subset of the data instead.

In the image above: a(i) is the average distance from i to all points in its cluster. b(i) is the average distance from i to all points in the nearest cluster. The metric is formulated by:

Dunn Index

The Dunn Index is another measure of the quality of clusters in a dataset. It is calculated by taking the ratio of the distance between the nearest pair of clusters to the maximum distance of any point from its own cluster centroid. A higher Dunn Index value indicates that the clusters are dense and well-separated, while a lower value indicates that the clusters are less distinct. In general, a higher Dunn Index is desired, as it indicates a more effective clustering of the data

Davies-Bouldin Index

The last measure we cover in this post is the Davies-Bouldin Index. It is calculated by taking the average similarity of each cluster to its most similar cluster, where similarity is defined as the ratio of within-cluster distances to between-cluster distances. The minimum Davies-Bouldin Index score is zero, and lower values indicate better clustering. The Davies-Bouldin Index takes into account the dispersion and separation of all clusters, whereas the Dunn Index only considers the worst cases in the clustering. In general, a lower Davies-Bouldin Index is desired, as it indicates a more effective clustering of the data.

Clustering Models

We’ll start with the hard clustering algorithms k-mean and k-medoids, move on to soft clustering with fuzzy C-means (FCM) and Gaussian mixture model (GMM), and finish with a discussion of hierarchical clustering.

k-means

K-means clustering is a popular method for dividing a dataset into a specified number of distinct groups, or clusters, based on the patterns and relationships within the data. The number of clusters, or k, is an important factor that must be determined before applying the k-means algorithm. The choice of k can significantly impact the resulting clusters and the ability of the algorithm to accurately identify patterns and relationships in the data. Therefore, it is important to carefully consider the appropriate value of k for a given dataset. It is also worth noting that the selection of k is sensitive to the random initialization of the algorithm, which can affect the resulting clusters. As a result, it is often necessary to experiment with different values of k in order to find the optimal number of clusters for a given dataset.

The algorithm works by first randomly selecting k initial centroids in the data space. Each data point is then assigned to the group with the closest centroid. The positions of the centroids are then recalculated based on the data points in their respective groups. This process is repeated until the centroids no longer move.

]This algorithm is useful for identifying patterns and relationships in data and can be applied to a wide range of applications.

Here is a short code snippet (sklearn library) to implement the k-means: As you can notice, we need to provide the number of clusters (k) on the second line. This is the key hyper-parameter. If we start playing with it, we will get different outcomes. Hence, prior knowledge about this parameter is a key to solving many problems in k-means.

from sklearn.cluster import KMeans
clustering= KMeans(k)
clustering.fit(X) 
pred = clustering.predict(x)

What will happen in the presence of noise / deviation in the data? Here is where the k-medoids may be useful..

k-medoids

K-medoids is the second hard clustering algorithm we will review. It is similar to k-means, but rather than using centroids to represent the groups, k-medoids use actual data points as the centers of the clusters. This makes k-medoids more resistant to the influence of noise and outliers, as the cluster centers are less likely to be affected by these types of data points.

from sklearn_extra.cluster import KMedoids
clustering= KMedoids(k)
clustering.fit(X)
pred = clustering.predict(x))

Can an example be a part of multiple clusters?

Fuzzy C-means (FCM)

Fuzzy C-means is a method of soft clustering. It allows for examples to be assigned to multiple clusters, with different levels of membership (probabilities). It can handle nonlinearly separable datasets as also outliers and noise in it. We need to be aware that using the FCM algorithm can be computationally more complex compared to k-means, as it needs to determine the level of membership for every data point to each cluster at every step of the iteration.

FCM algorithm starts by setting the initial cluster centers and determining the level of membership for each data point to each cluster. The algorithm then repeatedly updates the level of membership for each point based on the distance to each cluster center and updates the cluster centers based on the level of membership of each point until the cluster centers and the membership values reach a stable state. The final output is the set of cluster centers and the degree of membership of each point to each cluster.

In the picture below we can see what is the level of membership for each intensity value.

Gaussian Mixture Model (GMM)

A soft clustering method, like FCM, uses a model-based approach where it assumes that the sample assignments are a mixture of a finite number of distributions. It uses Gaussian distributions with unknown parameters (expectation and variance). However, it is also possible to use other distributions for the sample assignments in this method.

The GMM algorithm starts by making an initial guess of the means and variances. Then, it calculates the probability for each sample based on the initial guess. The algorithm then updates the means and variances using the calculated probabilities. The process is repeated until the means and variances reach a stable state.

from sklearn import mixture
clustering=mixture.GaussianMixture(n_components)
clustering.fit(X) # Train the model
pred =clustering.predict (x) # Use the model

Can we neglect the number of clusters parameter?

Hierarchical Clustering

Hierarchical Clustering is a method of grouping data points based on their similarities. It creates a hierarchy of clusters by either merging or dividing the existing ones. There are two types of Hierarchical Clustering: Agglomerative (bottom-up) and Divisive (top-down). Merges and splits are performed in a greedy manner, and the final output is presented as a dendrogram.

DIANA algorithm is one popular method where data points are grouped by repeatedly splitting one large cluster into smaller ones. The algorithm starts by considering all data in the same cluster and uses heuristics to determine the best way to split it. The objects with the highest average dissimilarity are chosen as the center of the new cluster and the objects that are more similar to them are grouped together. The final output is individual clusters.

In the figure below, a dendrogram of the well-known Iris dataset is shown. There are three colors against three clusters (flower species).

Hierarchical Clustering is a powerful method that has several advantages over k-means. It can handle any number of clusters, is robust to noise and outliers, helps determine the number of clusters, and doesn’t assume spherical clusters. It also provides a clear visual representation of the relationships between clusters in the form of a dendrogram.

from sklearn.cluster import AgglomerativeClustering
clustering = AgglomerativeClustering().fit(X)
pred =clustering.predict(x) # Use the model

Example of Customer Segmentation with K-means

We will use the mall_cutomers dataset, available on Kaggle. It has 200 examples and we will use the 3 following features, as presented in the image below: age, annual income, and spending score (In the original dataset there are also customer_ID and gender features).

By fitting/training the k-means with 4 clusters (k=4) each pair of features can be visualized as follows:

Spending score vs. Age for k=2,4, and 10 (Image by Author)

Spending score vs. Annual Income for k=2,4, and 10 (Image by Author)

Examples with all three features clustered into k=4 clusters

We can see that the number of clusters heavily affects the separation of the examples, it seems that k=2 is too low and k=10 is too high. The value k=4 is somewhat reasonable (visually).

Summary

We discussed the two main ways of clustering, hard and soft, and provided examples of clustering in different fields such as customer segmentation where we explore the number of clusters in a numerical example. We reviewed three evaluation metrics for clustering: Silhouette, Dunn, and Davies-Bouldin Indices. We reviewed several clustering algorithms including k-means, K-medoid, FCM, GMM, and hierarchical clustering. Finally, we concluded that clustering is a powerful technique that helps to organize data and find patterns in it.

follow me on Medium for more posts

About the Author

Dr. Barak Or is a professional in the field of artificial intelligence and sensor fusion. He is a researcher, lecturer, and entrepreneur who has published numerous patents and articles in professional journals. Dr. Or leads the MetaOr Artificial Intelligence firm. He founded ALMA Tech. LTD holds patents in the field of AI and navigation. He has worked with Qualcomm as DSP and machine learning algorithms expert. He completed his Ph.D. in machine learning for sensor fusion at the University of Haifa, Israel. He holds M.Sc. (2018) and B.Sc. (2016) degrees in Aerospace Engineering and B.A. in Economics and Management (2016, Cum Laude) from the Technion, Israel Institute of Technology. He has received several prizes and research grants from the Israel Innovation Authority, the Israeli Ministry of Defence, and the Israeli Ministry of Economic and Industrial. In 2021, he was nominated by the Technion for “graduate achievements” in the field of High-tech.

Website www.metaor.ai Linkedin www.linkedin.com/in/barakor/ YouTube www.youtube.com/channel/UCYDidZ8GUzUy_tYtxvVjRiQ

More resources:

[1] A great notebook with 10 different clustering models, including GMM and DBSCAN by Abdul Meralon Kaggle: https://www.kaggle.com/code/abdulmeral/10-models-for-clustering

[2] Nice post that explains Mean-Shift Clustering, DBSCAN, and more by George Seif: https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68

[3]Great post on Spectral Clustering by William Fleshman: https://medium.com/towards-data-science/spectral-clustering-aba2640c0d5b

[4] Impressive post on How to Apply K-means Clustering to Time Series Data by Alexandra Amidon