Clustering categorical and numerical datatype Using Gower Distance

Muhammad Muhaimin
3 min readAug 7, 2018

--

Data comes in various forms and shapes. Sometimes we have continuous numerical data and sometimes we have discrete categorical data. In real-world scenario many times we have data that are mixed which has both numerical and categorical attributes. To get meaningful insight from data, cluster analysis or clustering is a very useful technique. In the following section, we are going to discuss certain approach we can follow for clustering mixed data type, where we have both numerical and categorical data.

Gower Distance

Gower Distance is a distance measure that can be used to calculate distance between two entity whose attribute has a mixed of categorical and numerical values. Details of Gower distance is out of scope of this post and I would not discuss it here but if you are interested you can read Introduction to the gower package to know details also you can read the original paper Gower (1971) A general coefficient of similarity and some of its properties. Biometrics 27 857–874.

First step is to analyze dissimilarity between observations in the data set using Gower distance. One way to express that, is using dissimilarity matrix. By using daisy function from package cluster we can easily calculate the dissimilarity matrix using Gower distance. Lets create a sample dataframe for our operation.

Created Dataframe

In the second step fromthe dataframe we can create the dissimilarity matrix from it using the daisy package.

library(cluster)gower.dissimilarity.mtrx <- daisy(sample.data, metric = c("gower"))

This will give us dissimilarity matrix as follows

Dissimilarities :
1 2 3 4 5 6 7
2 0.3590238
3 0.6707398 0.6964303
4 0.3178742 0.3138769 0.6552807
5 0.1687281 0.5236290 0.6728013 0.4824794
6 0.5262298 0.2006472 0.6969697 0.4810829 0.3575017
7 0.5969786 0.5472028 0.7404280 0.7481861 0.4323733 0.3478501
8 0.4777876 0.6539635 0.8151941 0.3433228 0.3121036 0.4878362 0.5747661

Lets save it in a csv file to use for later use.

dissimilarity.mtrx.csv.content = as.matrix(gower.dissimilarity.mtrx)
write.table(dissimilarity.mtrx.csv.content,
'dissimilarity.mtrx.csv',
row.names=FALSE,
col.names=FALSE,
sep=",")

Finally we have a dissimilarity matrix from gower distance that we can use in our next step.

Now as we have the dissimilarity matrix lets do clustering from it, for clustering we will use R’s PAM (Partition Around Medoids) algorithm. And we define the size of the cluster by doing Silhouette analysis.

dist <- gower.dissimilarity.mtrx
pamx <- pam(dist, 3)
sil = silhouette (pamx$clustering, dist)
plot(sil)
Silhouette Analysis
dist <- gower.dissimilarity.mtrx
pamx <- pam(dist, 4)
sil = silhouette (pamx$clustering, dist)
plot(sil)

Based on that we can choose 3 cluster as our cluster size.

Inspired from :

https://medium.com/@anastasia.reusova/hierarchical-clustering-on-categorical-data-in-r-a27e578f2995

--

--