Clustering and how to harness the power of KMeans and GMM using sklearn

purici_marius

Published in

softplus-publication

10 min readSep 8, 2021

Stuff this article aims to cover

KMeans
Silhouette Score
Marketing Segmentation
GMM vs KMeans

Introduction

What is clustering? Clustering is a category of unsupervised machine learning models.

What is unsupervised learning then? Unsupervised learning is a class of algorithms that take a dataset of unlabeled examples and for each feature vector x as input either transforms it into another vector or into a value that can be used to solve a practical problem. For example, in clustering this useful value is the id of the cluster.

In other words, clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete labeling of groups of points. There are a lot of clustering algorithms already implemented in libraries such as scikit-learn and others, so no need to worry about that part. One of the simplest to understand clustering algorithms is KMeans.

KMeans

The KMeans algorithm searches for k number of clusters (the number k must be known in advance) within an unlabeled multidimensional dataset. For the KMeans algorithm the optimal clustering has the following properties:

The “cluster center” is the arithmetic mean of all the points belonging to the cluster
Each point is closer to its own cluster center than to other cluster centers

KMeans is implemented in sklearn.cluster.KMeans, so let’s generate a two dimensional sample dataset and observe the k-means results.

Now, let’s apply KMeans on this sample dataset. We will visualize the results by coloring each cluster using a different color. We will also plot the cluster centers.

As you can see the algorithm assigns the points to the clusters very similarly to how we might assign them by eye.

How does the algorithm work? KMeans uses an iterative approach known as expectation–maximization. In the context of the KMeans algorithm, the expectation–maximization consists of the following steps:

Randomly guess some cluster centers
Repeat steps 3 and 4 until converged
E-Step: assign points to the nearest cluster center
M-Step: set the cluster centers to the mean

The KMeans algorithm is simple enough for us to write a really basic implementation of it in just a few lines of code.

Well tested implementations, such as the one from scikit-learn, do a few more things under the hood, but this trivial implementation allows us to view one caveat of expectation-maximization.

Let’s test this implementation on our sample dataset.

Everything seems to be ok, right? Now, let’s try and change the random seed.

What happened??? Well, there are a few issues to be aware of when using the expectation-maximization approach. One of them is that the globally optimal result may not be achieved. Although the E–M procedure is guaranteed to improve the result in each step, there is no assurance that this will lead to the globally optimal result. This is why, better implementations, such as the one from Scikit-Learn, by default run the algorithm for multiple starting guesses. (In Scikit-Learn this is controlled by the n_init parameter, which defaults to 10)

Another caveat of the KMeans algorithm is that it is limited to linear cluster boundaries. The fundamental assumption that points will be closer to their own cluster center than to others means that the algorithm will often be ineffective when dealing with clusters that have complicated geometries. Consider the following example, along with the results found by the typical KMeans approach.

Fortunately, this issue is easily solved using one version of kernelized KMeans that is implemented in Scikit-Learn within the SpectralClustering estimator.

Other problems that you might have to deal with when using the KMeans algorithm are

KMeans might be slow for large number of samples. Because each iteration of the KMeans algorithm must access every point in the dataset, the algorithm might become relatively slow as the number of samples grows. Some sort of fix for this issue might be using just a subset of the data to update the cluster centers at each step.
The number of clusters must be selected beforehand. This is a common challenge because KMeans cannot learn the number of clusters from the data. For example, we can ask the algorithm to identify 4 clusters from our previous sample dataset and it will proceed without a second thought.

Whether this is a meaningful result or not is a question that might depend on the context. Some well known approaches that might help when picking the right number of clusters are the elbow method (a heuristic approach that we won’t discuss in this article) and the Silhouette Score about which we will continue the discussion.

Silhouette Score

Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of [-1, 1].

Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b — a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples — 1.

In scikit-learn there is sklearn.metrics.silhouette_score that is used to compute the mean Silhouette Coefficient of all samples and there also is sklearn.metrics.silhouette_samples that computes the Silhouette Coefficient for each sample.

Now let’s put the newly acquired knowledge in practice and write a function that shows the Silhouette Plot for a certain number of clusters.

As you can see the best number of clusters judging by the average Silhouette Score is 3. Furthermore, these plots provide more valuable information than you might expect. They provide information about the size of each cluster relative to other clusters and they also provide information about the approximate Silhouette Coefficient for each sample in the cluster.

Marketing Segmentation

In the context of Marketing Segmentation, also known as Customer Segmentation, Cluster Analysis involves the use of a mathematical model to discover groups of similar customers based on finding the smallest variations among customers within each group. These homogeneous groups are known as customer archetypes or personas.

For the following example, we will use some data from the kaggle website.

The CSV file can be downloaded from here or by using the kaggle CLI with the following command

First of all, let’s import the data using pandas.read_csv and do a little preprocessing.

Note that we standardize the data before applying the clustering algorithm. This is a good practice and although sometimes this step is not required, it rarely hurts to do it.

Now, let’s use the function that we defined previously in order to perform Silhouette Analysis and find the best number of clusters.

Based on these plots, the best number of clusters seems to be 4 because

it has the biggest Silhouette Score
the clusters are relatively of equal sizes
it does not have samples with negative Silhouette Coefficients

Now, let’s go a step further and see what the cluster centers look like for each of the clusters and see if we can divide the customers from this dataset into distinct customer archetypes.

So, based on the cluster centers we can define 4 customer archetypes

Younger Males with High Spending Score
Older Males with Low Spending Score
Younger Females with High Spending Score
Older Females with Low Spending Score

GMM vs KMeans

Before diving deeper into the differences between these 2 clustering algorithms, let’s generate some sample data and plot it.

We generated our sample data and we applied the KMeans algorithm. After looking at the way each point has been assigned to its cluster we notice that it seems to be a slight overlap between the 2 top right clusters. And this observation leads us to expect that the clustering assignments for some points is more certain than clustering assignments to over points. Well, KMeans has no measure of probability or uncertainty for clustering assignments (and if you probably guessed it GMM does).

We can visualize the way KMeans assigns clusters by placing a circle (in higher dimensions, a hyper-sphere) at the center of each cluster, with a radius defined by the most distant point in the cluster. In the training set, every point outside the circle is not considered a member of the cluster.

Another observation that has to be made is that these cluster models are always circular. KMeans has no builtin way to deal with clusters that have oblong or elliptical shapes.

In short, these are the 2 main disadvantages of KMeans:

lack of flexibility in cluster shape
lack of probabilistic cluster assignment

A Gaussian mixture model (GMM) attempts to find a mixture of multidimensional Gaussian probability distributions that best model any input dataset. In the simplest case, GMMs can be used for finding clusters in the same manner as KMeans does.

But because GMM contains a probabilistic model under the hood, it is also possible to find the probability that any point in the given dataset belongs to any given cluster.

The probabilities variable is a numpy array, it has a shape of number of samples x number of clusters. Each row corresponds to a single data point and the jth column corresponds to the probability that the sample belongs to the jth cluster.

Under the hood, the GMM algorithm has something in common with KMeans. They both use the expectation–maximization approach that qualitatively does the following:

Choose starting guesses for the location and shape
Repeat steps 3 and 4 until converged
E-Step: for each point, find weights encoding the probability of membership in each cluster
M-Step: for each cluster, update its location, normalization, and shape based on all data points, making use of the weights

In the result of this approach, each cluster is associated with a smooth Gaussian Model (not with a hard-edged sphere as in KMeans). Also in practice this algorithm uses multiple random initializations because just like KMeans it might miss the globally optimal solution.

The following function will help us visualize the locations and shapes of the GMM clusters.

Now let’s use the function to see how the clusters are shaped now on our data.

In order, to make this advantage of GMM a little more obvious we will create a new sample dataset that has a different shape.

Now let’s apply both algorithms and see what the resulting clusters look like.

These results make it clear that GMM has a lot more flexibility in cluster shape.

Also, note that for the previous fit the hyperparameter covariance_type was set differently. This hyperparameter controls the degrees of freedom in the shape of each cluster.

The default is covariance_type=”diag”, which means that the size of the cluster along each dimension can be set independently, with the resulting ellipse constrained to align with the axes.
A slightly simpler and faster model is covariance_type=”spherical”, which constrains the shape of the cluster such that all dimensions are equal. The resulting clustering will have similar characteristics to that of KMeans, though it is not entirely equivalent.
A more complicated and computationally expensive model (especially as the number of dimensions grows) is to use covariance_type=”full”, which allows each cluster to be modeled as an ellipse with arbitrary orientation.