Gower’s Distance

Divyanshu Anand
Analytics Vidhya
Published in
3 min readJun 17, 2020

The concept of distance is basic to human experience. In everyday life it usually means some degree of closeness of two physical objects or ideas, while the term metric is often used as a standard for a measurement. We can consider distances between observations or distances between quantitative or qualitative variables.

Distance

One of the most important task while clustering the data is to decide what metric to be used for calculating distance between each data point. In various real-life fields where cluster analysis is commonly used, such as biology, social sciences, or marketing surveys, datasets with both quantitative and categorical variables are often applied. This type of data is referred as mixed data. Many distance metrics exist, and one of them is, the Gower distance (1971) which is used when the data is of Mixed data.

What is Gower’s Distance?

What is Gower’s Distance ????

Gower’s Distance can be used to measure how different two records are. The records may contain combination of logical, categorical, numerical or text data. The distance is always a number between 0 (identical) and 1 (maximally dissimilar). The metrics used for each data type are described below:

  • quantitative (interval): range-normalized Manhattan distance
  • ordinal: variable is first ranked, then Manhattan distance is used with a special adjustment for ties
  • nominal: variables of k categories are first converted into k binary columns and then the Dice coefficient is used

How Gower’s Distance Works?

Gower’s distance is computed as the average of partial dissimilarities across individuals. The general form of the coefficient is the following:

Gower’s Distance Formula with sj(x1,x2) as the partial similarity function computed separately for each descriptor

For quantitative descriptors,

Formula for sj(x1,x2) when the data is numeric

For qualitative descriptors, Dice distance is calculated. Whenever the values are equal , Dice Distance = 0 and when they’re not equal this is how sklearn calculates Dice Distance.

Implementing Gower’s Distance in Python

Interpretation

Distance between Row 1 and Row 3 is 0.02 and that of between Row 1 and Row 6 is 0.68.

This indicates that Row 1 is the most similar to Row 3. We can verify the same as Age and Gender of the rows are same and preTestScore, postTestScore as well as available_credit are closely related. In the same way, distinction between Row 1 and Row 6 can also be made.

Conclusion

In this post, we’ve gone through Gower’s distance and how it computes distances between pairs of variables over two data sets and then combines those distances to a single value per record-pair.and implementation of the same in python.

Lastly, I used this metric in clustering. Welcome any feedback.

--

--

Divyanshu Anand
Analytics Vidhya

Execute analytical experiments to help solve various problems, making a true impact across various domains and industries.