A new approach to clustering interpretation

How to cluster population groups and interpret the results

Veronica Nigro
Bricklane Tech
6 min readMar 29, 2021

--

Hello! I’m Veronica from Bricklane’s data team.

In this article I will explain how to interpret clustering results using SHAP value analysis and how Bricklane used this to understand population groups.

Bricklane is a property investment platform that enables investors to efficiently and flexibly access the UK’s largest asset class — residential property. Our business proposition is to construct portfolios of properties, which investors can earn returns from the rental income.

Using a data-driven approach, we’ve created a new way to diversify our portfolio and understand the characteristics of the areas we are investing in. This is achieved using clustering algorithms, an efficient and powerful way to understand relationships and groups within your data.

Clustering Algorithms

Clustering is a machine learning technique used to find structures within data, without them being tied down to a specific outcome. Some applications include the categorisation of customer segments based on their behavior or, like in our case, the identification of population groups to better understand geographic areas, allowing for better portfolio diversification.

There are many types of clustering algorithms. Most of them use similarity or distance measures between data points in the feature space to discover dense regions of observations. Those distances are then used to group data points with similar traits, assigning them into clusters. I won’t go into further detail as there is a lot of material online on different clustering types and how to apply them using Python.

In order to address our business problem, I considered a host of features, including industries, age and income. For this exercise, I used a K-means algorithm, a model that is efficient, easy to implement and generally used to find implicit patterns in the data when no category label is provided. It particularly suits large datasets whose distribution on the feature axes represents a sphere, whereas for datasets that contain hierarchical relationships, hierarchical clustering works better.

Understanding cluster results

Once the clusters are created, the model will output a label for each row, representing the cluster to which it belongs. Unfortunately, the clusters are simply represented by a number and there’s no easy way to understand the characteristics of each one. In our case, we know that we can divide the population into 5 groups, but we don’t know whether one group represents students on a minimum wage or high earning executives.

Some articles on the internet suggest the following solutions to tackle this problem:

  1. Fit a decision tree on the dataset using the cluster labels as target variable and then plot the decision tree in order to look at the rules of the partitions. This method, however, can only be used for shallow trees and datasets with only 2 or 3 features, as a more complicated tree would have too many rules and become impossible to interpret.
  2. Fit a logistic regression model to predict the clusters and interpret the coefficients. This solution is suitable for datasets with multiple features and a binary target variable. For multi-class problems, we would have to fit the classification model for each cluster using a one-versus-all approach, making the process quite time consuming. Moreover, the predictions of the classification model need to be almost perfectly accurate, to ensure that the labels predicted actually represent the clusters identified by K-means, so a very simple model might not satisfy this condition.

Since in our case we have a big dataset with many features and multiple clusters, both approaches wouldn’t work, so I had to come up with a different solution. I decided to fit a CatBoost classification model on the dataset using the cluster labels as the target variable and then performed SHAP value analysis to understand the characteristics of the clusters. Any other sufficiently accurate tree-based classification model would have worked, as long as it’s supported by SHAP.

SHAP (SHapley Additive exPlanations) is a method to explain individual predictions by computing the contribution of each feature to the prediction. It ranks variables by feature importance and shows the relationship between the value of a feature and the impact on the prediction of the clusters.

The contribution of a feature is calculated by considering all possible different variable permutations, getting the contribution of each variable, then taking the average of these contributions to get the final contribution of that feature. In other words, the contribution of a variable is calculated by the difference that it brings to the final predicted value.

After fitting the classification model, the effects of the features on the prediction of each cluster can be visualised using a summary plot. A summary plot combines feature importance with feature effects. Each point on the summary plot is a Shapley value for a feature and an instance. The position on the y-axis is determined by the feature and on the x-axis by the Shapley value, which, in our case, represents the effect on each predicted label. The color represents the value of the feature from low to high. Overlapping points are jittered in y-axis direction, so we get a sense of the distribution of the Shapley values per feature. The features are ordered according to their importance.

The characteristics of the clusters can be identified by looking at the top features that have a positive impact on each predicted label. The example below shows the summary plot of one of the clusters for its top 12 features. We can see that areas with an increasing number of people aged 25 to 34 and high percentages of people aged 16 to 24 that work in education, accommodation and food services have the biggest positive impact on defining the cluster. We can then deduce that these are the most prevalent characteristics of the areas belonging to this cluster.

The SHAP summary plot ranks variables by feature importance and shows their effect on the predicted variable (cluster). The colour represents the value of the feature from low (blue) to high (red). This cluster, for example, is defined by high percentages of young people working in education, accommodation and food services.

I then visualised the summary plots of the other clusters and assigned them more descriptive names based on their characteristics:

  • Pensioners — older population working in public administration and defence, manufacturing and construction.
  • High Earning Young Professionals — young professionals on a high annual income mostly working in information and communication, finance and insurance.
  • Wealthy Families — families on a high annual income working in information and communication, finance, insurance and education.
  • Urban Youth — younger adults working in education, accommodation and food services, generally on a lower income.
  • Lower Income Traditional Sectors — families on lower incomes working in transport, wholesale, retail and manufacturing.

Mapping results

Eventually, my clusters are plotted out on a map to better understand where the groups are distributed. We notice that the clustering model is able to accurately identify cities and area dynamics, like city centres, suburbs and pensioner getaways.

The centre of London, unsurprisingly, is characterised by high earning, junior professional households. In the outskirts, we see a majority of wealthy families, areas that have a balance of larger housing for family raising, but still within commuting distance. On the contrary, the centres of Liverpool, Manchester and Birmingham turn out to be a magnet for many students and young professionals, drawn by a combination of access to jobs, leisure facilities and cultural pursuits.

Finally, the south-west of Manchester, by providing a mixture of excellent education facilities, low crime rate and being a pleasant area, is more attractive to wealthy families, whereas moving toward the outskirts of Birmingham we notice that it is populated by more traditional sector workers, linked with the large number of factories in the Midlands.

Clusters map of different cities in the UK. The clustering model is able to identify cities and area dynamics, like city centres, suburbs and pensioner getaways.

Conclusion

Clustering is an effective and efficient way to understand groups in your data. Coupled with modern Machine learning interpretability models, it is a massively powerful tool for understanding data in a data scientist’s toolbox.

Enjoyed this article? Don’t forget to subscribe for future updates!

--

--