How to calculate Gower’s Distance using Python
Often in our analysis we tend to group similar objects together and then apply different rules and validation on these groups instead of separately dealing with each single point.
In Machine Learning world this activity is called as clustering. There are many algorithms which are used for clustering K-Means, DBSCAN, Hierarchical Clustering etc. but none of them are efficient if you have both numerical and categorical data in your dataset. Gower’s Measure aims to solve this problem.
How do you measure similarity?
How will you differentiate between customer A who has a bank balance of 10000 $, average credit card spend is 300 $/ month, lives in LA and is a restaurant owner compared to another customer B who has a bank balance of 11230 $, average credit card spend is 1300 $/ month, lives in NJ and is employed in private firm as a Business Analyst?
And what about customer C who is also a Business Analyst, does not have a credit card lives in LA and has a 50000$ in his bank account. If you compare residence than C and A are similar but on the other hand if you compare their profession than C is clearly more similar to B.
Clearly all three of them are different as all three of their attributes are not same with one another. But the real question is how do we define how dissimilarity? How much similar/dissimilar are they? If we need to pair one and leave out one how do we decide? In analytics we are ought to quantify everything. We need numbers to back our observations. Grouping data points over only numerical features is easy, K-Means does that for us. Same can be said for the categorical data K-Mode can be used for that purpose. When we have a mix of both numerical and categorical features clustering fails to do a good job.
1. What is Gower’s Measure
2. Data preparation
3. Understanding the calculations
5. One line implementation
Gower’s disatance/measure/coefficient/similarity is a measure to find the similarity between two rows of a dataset consisting of mixed type attributes. It uses the concept of Manhattan distance for continuous variables and dice distance for measuring similarity between Binary variables.
Down below is the mathematical formula to calculate Gower’s Distance:
‘S’ can also be said as the similarity value that we are interested in calculating.
For this blog I will use 3 features (2 numeric and 1 categorical) to demonstrate the metric. We’ll start by importing necessary libraries first.
import pandas as pd
import numpy as np
from sklearn.neighbors import DistanceMetric
Building the dataframe
df = pd.DataFrame([[1,2.6,'A'],[12,5,'X'],[4,7,'A']])
df.columns = ['Num_1','Num_2','Cat_1']
So I now need to find similarities between 2 data points based on each of the feature. I will be finding out 3 different ‘S’ value as given in the formula.
Understanding the calculations
Going forward let I’ll call these three similarities as s1, s2 and s3. I will keep all the weights as 1.
s1 = similarity based on ‘Num_1’
This is a two step process
- Finding the Manhattan distance between each of the point
- Normalizing the matrix
s1 = DistanceMetric.get_metric('manhattan').pairwise(df[['Num_1']])
# normalizing the matrix by dividing with the highest values
s1 = s1/max(np.ptp(df['Num_1']),1)
Likewise, s2 can be calculated.
#performing both the step at the same time
s2 = DistanceMetric.get_metric('manhattan').pairwise(df[['Num_2']])/max(np.ptp(df['Num_2']),1)
Now calculating s3 is not going to be the same as in s1 and s2 we were dealing with numerical values. We first need to convert the categorical values in to dummy variables and then apply dice distance.
if you’re wondering what is dice coefficient then just remember it is again a method to calculate the similarity between categories.
To cut it short whenever the values are equal , DiceDistance = 0
And when they’re not equal this is how sklearn calculates DiceDistance.
For our problemNTF : number of dims in which the first value is True, second is False = 2.NFT : number of dims in which the first value is False, second is True = 1.
NNEQ : number of non-equal dimensions, NNEQ = NTF + NFT = 3.
NTT : number of dims in which both values are True = 0.NNZ : number of nonzero dimensions, NNZ = NTF + NFT + NTT = 3.DiceDistance = NNEQ / (NTT + NNZ) = 3/3 = 1
Let’s back our above manual calculation by python code. s3 value can be calculated as follows
s3 = DistanceMetric.get_metric('dice').pairwise(dummy_df)
As expected the matrix returns a value of 1 wherever if finds a non equal value.
And with we have all the values. s1,s2 and s3 as calculated and w1 = w2 = w3 = 1.
Let’s calculate Gower’s Measure now.
Gowers_Distance = (s1*w1 + s2*w2 + s3*w3)/(w1 + w2 + w3)
There you have it the matrix above represents the Similarity index between any two data points.
No calculation/algorithm is complete with out the interpretation of the numbers derived from it.
Distance between Row 1 and Row 2 is 0.84 and that of between Row 1 and Row 3 is 0.42.
This means Row 1 is more similar to Row 3 compared to Row 2. Intuitively this makes sense as if we take a look at the values. Both the rows have same values in the categorical col. And value in Num_1 is also closer compared to Row 2.
Bravo !!! If we’ve made it this far then you’ve done pretty well to learn this.
But wait. Don’t we have any package in Python that automatically does this instead of manually writing the formulas?
One Line Implementation
Well unlike R there was no package available until 2019 Dec. That probably might be the reason why we have very limited people talking about this topic.
Note the package does not comes pre-installed with Anaconda. You will have to install it and restart your kernel before you can start using it.
pip install gower
Fantastic part about this package is you dont even have to do the dirty work of converting the values into dummy variable.
Notice that the output received is exactly identical with the one when we carried out the calculations manually.
Grouping object based on similarity is one of the very talked about use cases in the industry. Personally, I applied this method in one of my projects where task was to group similar vendors based on 17 different KPI’s which were related to their financial activities. That was the first time I landed in to Gower’s Measure.
The analysis can be extended where the output matrix can be directly be fed into numerous algorithms like t-sne algorithm for dimesnionality reduction, hierarchical clustering etc.
Hope you had a good time reading the blog. If you like my blog please show some support by giving me some claps to this article.
Link to the entire code can be found here : GitHub
My LinkedIn Profile : LinkedIn
Gower's distance calculation in Python. Gower Distance is a distance measure that can be used to calculate distance…