7 Important Distance Metrics every Data Scientist should know.

Published in

Geek Culture

6 min readJun 30, 2021

Hola,

Distance metrics play a vital role in most machine learning models. Distance metrics are basically used to enhance the performance of similarity-based algorithms.

The distance metrics usage has been present since its inception. Basically, Distance provides a similarity measure between two data points. One of the most popular examples of distance-based metrics is well known Nearest neighbors rule for classification, where a new sample is labeled with the majority class within its nearest neighbors.

Algorithms behind Nearest Neighbour classifiers are the main motivation behind distance-based learning. These kinds of algorithms have standard distance metrics like the Euclidean distance in order to measure data similarity. The distance parameter that data brings as a comparison with non-similar data can significantly increase the quality of these algorithms.

An effective distance metric improves the performance of our machine learning model, whether that’s for classification tasks or clustering.

In this blog, we are going to walk through some of the most used Distance metrics that every data scientist must know-

Euclidean Distance
Manhattan Distance
Chebyshev Distance
Minkowski Distance
Hamming Distance
Cosine Similarity
Jaccard Similarity

Euclidean Distance

Euclidean Distance is one of the most commonly used distance metrics. Mathematically it is the square root of the sum of differences between two different data points.

Most Machine learning algorithms including K- Means Clustering uses Euclidean distance in order to calculate the similarity between two data points or observations.

Moreover, Euclidean distance is used when we are calculating the distance between two sets of data that have numerical data type i.e.integer or float.

Point to keep in mind that if your data have features with different scales it is a must to normalize or standardize the features across all columns before calculation of Euclidean distance. If scaling of features is not performed then large values in features will dominate the distance metrics.

Manhattan Distance

Manhattan Distance also known as City Block Distance or Taxicab Distance calculate the distance between two real-valued vectors. Mathematically Manhattan distance is calculated as the sum of absolute distances between two different data points.

It is mostly used when two vector lies in a uniform grid-like chessboard or city block. The name taxicab creates an intuition for what measures the shortest path that a taxicab would take between city blocks which are coordinates on the grid.

Mostly it is recommended to use Manhattan distance instead of Euclidean distance when we have real value vectors in integer dimensional space.

Chebyshev Distance

The Chebyshev distance is calculated as the maximum of the absolute difference between two different vectors. It is also called Chessboard Distance or L infinity Distance or Maximum value distance.

The best intuition which I got from this distance is the movement of King in Chessboard it can move in up, down, left, right any direction.

The Chebyshev distance is sometimes used in the logistic warehouse as it effectively measures the time crane took to move an object. Also widely used in Computer-Aided Manufacturing applications for the optimization of machines operating in planes.

Minkowski Distance

Minkowski Distance generalizes Euclidean and Manhattan Distance. It is also called p-norm vector as it adds a parameter called the “p” that allows different distance measures to be calculated.

For Different values of p Minkowski will be-

p=1, the distance measure is the Manhattan measure.
p=2, the distance measure is the Euclidean measure.
p = ∞, the distance measure is the Chebyshev measure

When we are implementing any Machine learning Algorithm that uses Minkowski as a distance measure we can tune the hyperparameter “p”.

Hamming Distance

Ahh! Hamming Distance is used when we have categorical attributes in our data. Hamming Distance measures the similarity between two string which must be of the same length.

Hamming Distance basically quantifies if two attributes are different or not. When they are equal Hamming distance is 0 else 1.

In the below code snippet you can see we are calculating the Hamming distance between two strings “euclidean” and “manhattan”.The distance between these two strings can be calculated as the average or sum of the number of bit differences between these two strings.

Hence the Hamming distance between these two strings will be 7.Also to keep in mind that Hamming distance works only when we have a string of the same length.

Cosine Similarity

Cosine similarity basically measures the similarity between two non-zero vectors. It is basically the cosine angle between two vectors that are most similar.

Most similar vectors will have 0 degrees between them hence the value of cos 0 is 1.Moreover the vectors opposite to each other have a value of -1 i.e. cos(180deg).

Hence it infers that cosine similarity ranges from -1 to 1.

The cosine distance is used when we want to calculate the distance between two sparse vectors. For example, if there are 1000 attributes collected about cars and 200 of these were mutually exclusive (meaning that one car had them but the others don’t), then there would only be a need to include 800 dimensions in the calculation.

Cosine similarity only cares about the angle between two vectors hence not the distance between them.

              cosine distance = 1-cosine similarity

Jaccard Similarity

In Jaccard similarity is used to understand the similarity between two sample sets. The Jaccard similarity emphasizes that the similarity between two finite sample sets instead of vectors and it is defined as the size of the intersection divided by the size of the union of the sample sets. The mathematically written as:

Unlike cosine distance, Jaccard distance will be calculated as

           Jaccard distance = 1 — jaccard similarity

Jaccard Similarity is commonly used in Convolution Neural nets with image detection applications in which it measures the accuracy of object detection Algorithms.

Inference

In this Blog, we have a discussion on various distance metrics that Data scientists should know. The selection of Distance Metrics should be based on your data.

Euclidean distance can be used if features are similar or if we want to find the distance between two data points.
When we have high dimensions then Manhattan is more preferred.
In the case of Categorical features, Hamming Distance is used.
Cosine similarity is used when we are concern about the direction of vectors not magnitude.

References

If you like this Blog, Please hit 👏 and follow me. If you have noticed any mistakes in the way of thinking, formulas, animations, or code, please let me know.

Cheers!

7 Important Distance Metrics every Data Scientist should know.

Written by Shashwat Tiwari