Gower’s Distance

Published in

Analytics Vidhya

3 min readJun 17, 2020

The concept of distance is basic to human experience. In everyday life it usually means some degree of closeness of two physical objects or ideas, while the term metric is often used as a standard for a measurement. We can consider distances between observations or distances between quantitative or qualitative variables.

One of the most important task while clustering the data is to decide what metric to be used for calculating distance between each data point. In various real-life ﬁelds where cluster analysis is commonly used, such as biology, social sciences, or marketing surveys, datasets with both quantitative and categorical variables are often applied. This type of data is referred as mixed data. Many distance metrics exist, and one of them is, the Gower distance (1971) which is used when the data is of Mixed data.

What is Gower’s Distance?

Gower’s Distance can be used to measure how different two records are. The records may contain combination of logical, categorical, numerical or text data. The distance is always a number between 0 (identical) and 1 (maximally dissimilar). The metrics used for each data type are described below:

quantitative (interval): range-normalized Manhattan distance
ordinal: variable is first ranked, then Manhattan distance is used with a special adjustment for ties
nominal: variables of k categories are first converted into k binary columns and then the Dice coefficient is used

How Gower’s Distance Works?

Gower’s distance is computed as the average of partial dissimilarities across individuals. The general form of the coefficient is the following:

Gower’s Distance Formula with **sj(x1,x2)** as the partial similarity function computed separately for each descriptor

For quantitative descriptors,

Formula for **sj(x1,x2) when the data is numeric**

For qualitative descriptors, Dice distance is calculated. Whenever the values are equal , Dice Distance = 0 and when they’re not equal this is how sklearn calculates Dice Distance.

Implementing Gower’s Distance in Python

Interpretation

Distance between Row 1 and Row 3 is 0.02 and that of between Row 1 and Row 6 is 0.68.

This indicates that Row 1 is the most similar to Row 3. We can verify the same as Age and Gender of the rows are same and preTestScore, postTestScore as well as available_credit are closely related. In the same way, distinction between Row 1 and Row 6 can also be made.

Conclusion

In this post, we’ve gone through Gower’s distance and how it computes distances between pairs of variables over two data sets and then combines those distances to a single value per record-pair.and implementation of the same in python.

Lastly, I used this metric in clustering. Welcome any feedback.