The Ultimate Guide for Clustering Mixed Data

Published in

Analytics Vidhya

12 min readNov 1, 2021

Clustering is an unsupervised machine learning technique used to group unlabeled data into clusters. These clusters are constructed to contain data points that are ‘similar’ to each other, and ‘dissimilar’ from data points assigned to other clusters. Every clustering algorithm works by organizing data points into clusters using some sort of ‘similarity’ measure. This similarity measure is typically calculated using Euclidean distance for numerical based data, whereas the Jaccard distance is used for categorical data.

Unfortunately, the majority of clustering algorithms can only work with data that exclusively contains either numeric or categorical features. This is a huge problem, as most real world datasets will contain multiple types of features. This article provides pythonic implementations of different methods that can be used for distance-based clustering with mixed data. The workflow for this article has been inspired by a paper titled “Distance-based clustering of mixed data” by M Van de Velden .et al, that can be found here. These methods are as follows:

Gower Distance.
Dimensionality Reduction Techniques.
Recode Categorical Variables.
Recode Continuous Variables.
K-prototypes.

“All of the code used to create this article can be found at GitHub”

Datasets

All of the methods below were applied to the three open sourced datasets. These datasets varied by the proportion of numeric & categorical variables that they contain. This variation provides a more rigorous evaluation process, for all of the distance based clustering methods.

The Cleveland Heart Disease data set from the UCI machine learning repository contains 13 variables in total, 5 numeric and 8 categorical. These 13 medical variables were gathered on 303 patients in order to diagnose if the patient does or does not have Heart Disease. For this analysis, any patient with missing records was removed from the dataset, leaving 297 patients for the actual analysis. The heart disease diagnoses was also used as our target variable to evaluate the performance of the clustering algorithms. Thus the number of clusters for this dataset was set to 2.

The Kaggle FIFA 19 Complete Player data set, is publicly available and contains information on approximately 7,000 players from 42 different soccer leagues. In order to speed up computation time, a random subset of 2,500 players were taken along with 43 continuous variables & 4 categorical variables that contain information about an individual player’s style of play. The player’s position (goalkeeper, striker, midfielder or defender), was used as a target variable to validate the performance of the clustering algorithms.

The Banking-Marketing dataset includes information relating to the direct marketing campaigns of a Portuguese banking institution. The dataset contains approximately 50,000 rows and 18 columns, due the class imbalance the dataset was randomly under-sampled to leaving 8,512 rows for the actual analysis. From a preliminary analysis 3 redundant columns were dropped due to their lack of correlation with the target variable. The external variable used to validate the clustering solution was if a client had subscribed to a term deposit or not. For this dataset the number of clusters was set to 2.

Evaluation Metrics

Evaluating any clustering algorithm is a complex task, as deciding on a definition of success, is completely context and problem specific. This is because the goal of your unsupervised analysis will vary based on your problem requirements. Despite this ambiguity, the ‘Adjusted Rand Index’ was used to measure the similarity between the obtained clusters and our ground truth values. The similarity of the clusters produced by the different algorithms was also measured using the ‘Adjusted Rand Index’. For fairness of evaluation all of the methods applied use a version of the K-Means ++ algorithm to create their optimal clustering solutions, the number of clusters (k) is also prior knowledge and was held constant for each method.

1. Gower Distance

Gower’s distance is a metric used to measure the similarity between two data points that contain both numeric and categorical variables. It simply works by applying different measures of similarity for the each data type. The similarity scores for each data type are then combined to create an overall similarity score. The measures used for each data type are as follows:

Numerical Variables: a normalized Manhattan distance
Categorical Variables: the variables are first one hot encoded and and then the Jaccard distance is applied.
Ordinal Variables: the variables are first sorted, then the Manhattan distance is applied with an adjustment for ties.

The gist below includes a toy python function that will create a Gower distance matrix from a pandas dataframe. This code will hopefully help you understand what is happening under the hood.

However, I would recommend using the “gower” python package if you actually intend to use this method on your own data. The package can simply be installed using the “pip” framework and then the following code can be executed to calculate your distance matrix.

For all three datasets we computed their corresponding Gower distance matrices using the above function. From a geometric stand point one should not compute Euclidean distances from the resultant Gower matrices. Therefore, this method cannot be combined with the Kmeans algorithm. Instead the KMedoids algorithm provided by the “sklearn_extra” package in python was used to determine the optimal clustering solution.

2D Visualization of Gower Clustering Solution

2. Dimensionality Reduction

Dimensionality reduction is a common technique used to cluster high dimensional data. This technique attempts to transform the data into a lower dimensional space while retaining as much information about the original data. For our purposes we are not interested in converting our data into a lower dimensional space. Instead we want to utilize the technique to homogenize our mixed dataset, by converting all of our variables into a normalized numeric values. For this article we leveraged two different dimensionality reduction techniques Factorial Analysis of Mixed Data (FAMD) and Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP).

a. Factorial Analysis of Mixed Data (FAMD)

This algorithm generalizes the Principal Component Analysis (PCA) algorithm to mixed datasets. This method, operates by first one hot encoding the categorical variables. Each numerical variable is then standardized, by subtracting the mean and dividing by the standard deviation. The one hot encoded variables are treated slightly different as they are divided by the squared root of the proportion of objects in the column (the number of 1s over the number of observations in the column) and then centered (subtract the mean). Finally the PCA algorithm is executed on the resulting matrix to obtain the final output. The following code snippet shows a pythonic implementation of the procedure of the FAMD algorithm described above.

The code above is more of a demonstration to visualize the procedure of how the algorithm works. I would not advise using this code for your own analysis . Instead I would use the python implementation from the “prince” package that can easily be “pip” installed.

2D Visualization of FAMD Clustering Solution

b. Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)

UMAP is a dimensionality reduction technique predicated upon manifold learning & ideas from topological data analysis. It provides a general framework for applying dimensionality reduction and is an incredibly powerful tool in any data scientist’s arsenal. In order to apply UMAP to mixed data we must create two separate manifolds one for each data type, as UMAP still needs to compute distances between points. These two manifolds are then combined either by computing the Union which preserves the categorical embedding more, or by calculating the intersection which will cause the embedding to more closely resemble the numerical manifold.

2D Visualization of the UMAP Clustering Solution

Recode Categorical Variables

The general pre-processing workflow for recoding categorical variables is to first one hot encode the variables. This means that for each unique category a new new binary variable is added appended to the dataset. Each one hot encoded variable is then standardized by subtracting the column mean and dividing it by the standard deviation. The approach is implemented in the Python code snippet below.

One thing to note is that Distance-based clustering methods, such as K-means, are not invariant against affine transformations. Also the choice standardizing the one hot variables has an effect on clustering performance. However an optimal choice that applies to various different data sets or clustering techniques does not exist.

2D Visualization of the Clustering Solution from the Recoding the Categorical Variables

Recode Numerical Variables

Numerical variables can be converted into categories by a process called discretization. Discretization splits the numerical variable into N number of intervals, the categories are then labelled according to the interval that the values fall between. Apart from the obvious lose of information another challenge with this approach, is choosing the correct method of discretization. Some variables will have a natural intervals that lend themselves to a meaningful discretization ie. breaking down age into natural age groups such as children, teenagers, adults and elderly. Unfortunately, in many cases these natural groupings do not exist or can not be applied without expert knowledge. This ambiguity on choosing the appropriate discretization method for each variable poses a problem. As the choice of discretization has a direct impact on the performance of the clustering algorithms.

For this article all of the numerical variables were discretized using the same method, by applying the KMeans algorithm (k=5) to each variable individually. This method is very simplistic & can definitely be improved upon, even by simply selecting the optimal number of k individually would improve the results. The following function applies the simple method described above by discretizing every numerical feature in a dataframe.

The distributions of Footballers Age before and after discretization

After all the numerical data was converted into categories three different clustering techniques were applied and there performance measured. These methods were:

KModes
Categorical Embedding + KMeans
Graph Analysis Community detection

a. KModes

The K-Modes algorithm modifies the standard K-Means process for clustering categorical data by replacing the notion of distances with dissimilarities. That means instead of measuring distances, it counts the total number of mismatches between two observations the smaller the count the more similar two observations are. Also instead of using the mean to find the centroid of the cluster it uses the mode. These changes allow the algorithm to converge to a local minimal for categorical datasets. For more information please check out this blog post.

Unfortunately as of yet the KModes algorithm is not implemented as part of the sklearn.cluster module. Therefore, the “kmodes” package from the PYPI antifactory is needed in order to cluster the categorical data. The code below demonstrates how to use the KModes function from “kmodes”.

2D Visualization of KModes Clustering Solution

b. Categorical Embedding + KMeans

As our data is now homogenous (containing only one data type) we can apply certain dimensionality reduction techniques that specialize in categorical data. The thought process here is to apply dimensionality reduction in order to create a standardized numeric representation of the data for the KMeans algorithm to cluster. In this article we chose the UMAP algorithm with the “Dice” distance metric, although any dimensionality reduction technique that can handle categorical data, could have been applied.

2D Visualization of the Clustering Solution from Categorical UMAP Embedding + KMeans

c. Graph Analysis Community Detection

We can transform our recoded (only containing categorical data) tabular data into a Graph structure by creating an edge list. To create an edge list we can consider each row to be a node in the graph, this row will be connected to it’s categorical values or associated metadata, as these categories will also be nodes in the graph.

Modelling the data in this way allows us to witness all of the different non linear interactions between the categorical variables and our observations. Below is the code necessary to convert your tabular data into an edge list.

An example of the the edge list created for the Heart Dataset each row is only connected to it’s categorical values. All edges have a weight of one.

Once we have a Graph, it is now time to analyze it. As this article is concerned with clustering we will perform community detection on the Graph. Community detection essentially finds subsets of nodes that are densely connected to each other and sparsely connected to the other nodes in the graph. This is pretty much the graph analytics equivalent of clustering. In this article we used the “community” python package to perform the Louvain method for community detection.

Visualizations of the Resultant Graphs

The resultant Graphs produced by the code above on each of our three datasets (left to right Banking Dataset Marketing Targets, Kaggle FIFA 19 Complete Player and Cleveland Heart Disease), the colors in each graph highlight the communities that the nodes belong to. The visualizations were created using the Gephi software.

K-prototypes

K-Prototypes is an adaptation of the KMeans algorithm that offers the ability to cluster mixed data. Just like KMeans, K-Prototypes measures the distance between numerical variables using ‘Euclidean’ distance. However, unlike the KMeans algorithm it measures the distance between categorical variables using the number of matching categories. The algorithm combines both of these distance metrics to compute the actual distance between samples.

Unfortunately, as of yet the K-Prototypes algorithm is not implemented as part of the sklearn.cluster module. Therefore, the “kmodes” package from the PYPI antifactory is needed. The code below demonstrates how to use the KPrototypes function from “kmodes”.

2D Visualization of the K-Prototypes Clustering Solution

Results

This article has demonstrated several pythonic implementations of distance-based clustering algorithms for mixed data. Regardless of the ubiquitous nature of mixed data in the wild, there is still no agreement on the best approach to cluster such data. The primary reason for this ambiguity, lies with the issue of assessing the quality or success of the clusters produced.

As illustrated by the results below, the applied methods can produce very unique outputs despite achieving similar ‘Adjusted Rand Index’ scores with the target label. This makes it a challenge, to compare the performances of our algorithms. Moreover, in a real world setting the clustering analysis will primarily be unsupervised so alternative methods must be applied to select the best algorithm. For this reason alone, some mockingly call unsupervised learning more of an art form than a science. Despite this, some performances stand out from the rest of our results.

First off, the UMAP 2D approach produces clusters that are almost completely different from the other algorithms. Also the resultant clusters perform significantly worse when evaluated against the target variables. In contrast, the clusters produced by the Recoded Categorical Variables method achieved the highest average Adjusted Rand Index value (0.26) with the target variables. Followed closely by the solutions provided by the UMAP Recode Numerical Variables method which achieved an average score of (0.25). Interestingly, the similarity between these two clustering solutions changes drastically from dataset to dataset. The novel community detection approach worked well, on real problems this method has the benefit of automatically choosing the number of clusters. However, for this analysis it hindered it’s performance as we could not pre-select the already known number of clusters. The performance of every clustering algorithm with respect to the target variable drops for the Banking dataset. These low scores can be attributed to the poor separation boundary between the classes. Clearly the chosen variables are not able to predict if a client had subscribed to a term deposit or not.

In conclusion, this article provides a variety of different distance based clustering methods to choose from. Unfortunately, there is “No Free Lunch” when it comes to clustering mixed data, as each clustering solution depends entirely upon your own success criterion. Although combining a plethora of different clustering techniques that will produce uncorrelated clustering solutions, with various evaluation metrics such as “Silhouette Score”, “Davies & Bouldin” and human evaluation to asses the quality of your clustering solutions. Will allow you to discover the optimal solution to your problem.