Clustering: Extracting Patterns From Data & Metrics + Cluster Validation [Pt.2]
Customers segmentation based on their credit card usage behavior
In part one, we learned some concepts and understood what we are going to do in the project. In part two, as I said, we will start discussing about metrics and cluster validation.
Source Code
Notebook on Kaggle:
Notebook on GitHub:
Table of contents
- Introduction
- Metrics
. . . Silhouette
. . . Davies Bouldin
. . . Calinski Harabasz
- Find the Ideal K Number of Clusters
. . . Elbow Method
. . . Alternative Methods
- Cluster Significance
- Conclusion
Introduction
Before continuing the project, we need to know how to answer questions like: does the cluster represent well the dataset? Which is the ideal K number of clusters? How to know if the data is valid to be clustered? Without knowing how to answer these questions, every work will be in vain.
In virtue of that, some metrics are used to evaluate a clustering algorithm, through these metrics we can measure the similarity inside a cluster and its separation from the other clusters. Learning how to analyze these metrics is the key aim of this article.
Metrics
Silhouette
The first metric that I wanna show is the most commonly used, and it’s very simple to interpret it. The silhouette value measures how similar an object is to its cluster (compression) compared to other clusters (separation). To understand this equation, I will define what each variable means.
First, I would like to explain a, which is the mean distance between the point and all the other points of the same cluster; measures the compression of the clusters; the smaller the value, the more compressed the clusters.
Let i be a data point in the cluster CI; CI be the number of points belonging to cluster i; and d(i,j) be the distance between the points i and j in the cluster CI. First, we sum the distance of the points and all the other points of the same cluster, then we divide by the number of data points in that cluster (we write CI — 1 because we do not include the distance d(i,i) in the sum).
Now, I will explain b, which is the mean distance between the point and all the other points of the closest cluster; measures the separation of the clusters; the bigger the value, the more separated the clusters.
Let CJ be the nearest neighbor of the point i in cluster CI (hence the min operator in the formula). First, we sum the distance of the points and all the other points of the closest cluster, then we divide by the number of data points in that cluster.
We now can define a silhouette value of one datapoint i:
The silhouette ranges from −1 to +1, where 1 means the clusters are well apart from each other and clearly distinguished; 0 means the clusters are indifferent, or we can say that the distance between clusters is not significant; -1 generally indicates that there are data points of a cluster inside a different cluster, it’s the worst result that we can have.
The logic behind the max(a, b) is that if the clusters are well apart of each other and clearly distinguished, the a will be very small and b will be bigger, which means that b-a will return a number close to b, and b will be selected to divide in max(a, b). Therefore dividing a number close to b by b, will result in a number close to 1.
Davies Bouldin
To understand the Davies Bouldin metric, we need first to understand what Ri,j means, which is given by the equation:
Where the Si is the mean distance between every point in cluster Ci to the centroid of cluster Ci, it measures the compression of the clusters, so we wanna this variable to be near 0.
Di,j is the distance between the centroids of the clusters Ci and Cj, it measures the separation of the clusters, so we wanna this variable as large as possible.
Remember that i need to be different from j, otherwise, if they are equal, the distance Di,j will be 0. The equation above gives the similarity between the cluster Ci and Cj, being minimized as the clusters are more distant and more compacted, that is, the closer to 0 the Ri,j, the more similar the clusters are.
The final equation for Davies Bouldin is given by the equation:
Where K is the number of clusters of the clustering.
The logic behind the max(Ri,j) is that: suppose we did a K-Means clustering with 5 clusters, and we calculated the Ri,j between all the 5 clusters and stored all the 10 values in a list (10 because we calculated the Ri,j of cluster 1 and 2, 1 and 3, 1 and 4, 1 and 5, 2 and 3… for all the clusters). Once our objective is for the Ri,j of the clusters to be as close to 0 as possible, we will only select the Ri,j of clusters 1, 2, 3, 4, and 5 that are the biggest, because if the mean of the worst results is considerably good, we can suppose that the clustering, in general, is very good.
Therefore, we wanna the DB as close to 0 as possible.
Calinski Harabasz
Only by looking at this equation, I can already imagine people thinking of skipping this metric, however, this metric is very simple to be comprehended, we just need to go by parts.
Let’s start with the right part of the equation. N is the number of elements of each cluster multiplied by the number of clusters and K is the number of clusters.
Now we can go to the SSw:
Which refers to the overall within-cluster variance, and is given by the sum of the squared distances of each data point X to the centroid of the cluster Mi to which it belongs. Measures the dispersion within cluster, so we wanna this value to be as close to 0 as possible.
Another required calculation is the overall between-cluster variance, which is given by the difference between the sum of the squared distances of each data point X to the data set centroid and the SSw. To find the data set centroid, or data set center, we have to take the X and Y coordinates of all the data points and divide them by the number of data points. The SSb measures the dispersion between clusters, so we wanna this value as large as possible.
Ultimately, this is the equation for the Calinski-Harabasz value:
Once N and K are constants, if SSb is bigger and SSw smaller, which is the objective, the result will be large. Unlike the other metrics, the higher the value, the better the clustering, very small values mean not high-quality clusters according to this metric.
Now that we learned about the three most common metrics, let’s see how we apply them in practice.
Find the Ideal K Number of Clusters
A K-Means algorithm requires that you specify the K number of clusters. Sometimes the number of clusters can be intuitive, or driven by the application. For instance, a company might want to cluster its customers according to “personas”, so in this case, managerial considerations would dictate the number of desired customer segments. Two maybe not yield a good differentiation of the customers, while nine might be too many to manage.
In absence of a K number of clusters, a statistical approach could be used. There are many methods to find the ideal number, a common approach is called elbow method.
Elbow Method
The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.
When I say explained variation I am referring to the WCSS (Within-Cluster Sum of Square), which can be calculated with the sum of the square distances between points in a cluster and the cluster centroid. We can get this information from the property inertia_.
Let’s see this graph of our dataset, from the range of 2 to 30 K number of clusters:
As the majority of the real-world datasets don’t have a clear elbow point, we have to rely on other methods for our analysis. However, not having a well-defined elbow point, does reveal the nature of the data of not having well-defined clusters.
We can even use libraries that automatically select the optimal elbow point for you, although the point is not so clear:
According to the library, the “best” K number is 5.
Alternative Methods
As we learned the metrics earlier, we can use them to identify the ideal K number of clusters. To do that I will first generate the values of the silhouette, davies bouldin, and calinski harabasz for 2 to 30 K number of clusters.
First of all, I will analyze the Silhouette score:
The objective of the Silhouette score is to be as close to 1 as possible, and the closer ones are between 5 and 6. But I’m still in doubt about what number should I choose so I will do another plot of the silhouette for 4 to 7 K number of clusters.
Each color in each graph represents a cluster, the size of the color represents the size of the cluster, and the dashed red line represents the average Silhouette score for the specified K number of clusters.
For an ideal K, all the clusters should have a Silhouette score bigger than the average Silhouette score, and for 4 to 7 K clusters, this condition is satisfied. Another thing that should be taken into consideration is the fluctuation in the size of the clusters, where I can see that clusters with 5 to 7 K clusters fluctuate more.
Before deciding on the K number, let’s see what the other metrics can tell us.
According to Davies Bouldin metric, where the objective is to be as close as possible to 0, again the K numbers 5 and 6 seem to be the best ones.
And according to Calinski Harabasz metric, where the objective is to be as large as possible, is the same.
After this analysis, I decided to choose 5 as the K number of clusters to be used in our clustering.
Cluster Significance
Now that we already decided how many clusters we want in the clustering, let’s see if the clusters are valid. One way of doing it, is generating a random dataset with the same size as the credit card dataset, and then comparing the values of the metrics of this dataset to the credit card dataset.
First I generated a random dataset with the same shape as the credit card dataset, then I computed the values of the silhouette, davies bouldin, and calinski harabasz for this dataset using the function return_metrics that we created earlier.
Comparing both datasets, we see that the credit card dataset achieved a better score in all the metrics. This way, we can know if our data is significant to be clustered or not.
Conclusion
In the second part, we learned three standard metrics for clustering evaluation, and understood how to find the ideal K number of clusters when the Elbow Method doesn’t provide a clear answer.
In the next part, we will begin by doing the analysis of the clustering, identifying the peculiarities of each cluster, and defining effective marketing suggestions. Unlike part two, the next part won’t have many math equations, however, we will have to do a lot of descriptive measures interpretation and graph visualization analysis.