A short introduction to distance measures in Machine Learning

6 min readFeb 19, 2020

“Lord, I’m one, Lord I’m two
Lord, I’m three, Lord I’m four,
Lord, I’m five hundred miles away from home.” (Image source: https://steemit.com/quotes/@solomon123/distance)

Finding similarity among data-points is an important task when we try to find clusters from a given data. Unsupervised learning algorithms like K-means believe on the theory — ‘closer the points more similar they are’ as there is no explicit measurement for similarity.

So, the clustering, the classification of observations into groups, needs some methods for computing the distance or the (dis)similarity between each pair of observations. The result of this computation gives an idea about the dissimilarity among the observations (why dissimilarity instead of similarity? think yourself… ) which is called distance matrix.

In our Euclidean world of mathematics we generally measure the distance between two points (remember that you can represent a point with a vector or a matrix) with a straight line joining them. The length of the straight line represents the shortest distance between these two points. This is also called as Euclidean distance. Apart from the minimum distance, there are multiple ways of measuring distance between two data-points. In most high dimensional applications the choice of the distance metric is not obvious. In this article our goal is to understand different ways of measuring distance (similarity, actually) and the use of those.

Norm

The length of a vector is a nonnegative number that describes the extent of the vector in space, and is sometimes referred to as the vector’s magnitude or the norm. — (No Bullshit Guide To Linear Algebra)

Norm is a generic representation of distance. Mathematically speaking —

A mapping x→‖x‖ from a vector space X over the field of real or complex numbers into the real numbers, subject to the conditions:
1. ‖x‖≥0 and ‖x‖=0 for x=0 only;
2. ‖λx‖=|λ|⋅‖x‖ for every scalar λ;
3. ‖x+y‖≤‖x‖+‖y‖ for all x,y∈X .
The number ‖x‖ is called the norm of the element x.

(Does the 3ʳᵈ point remind you the triangle law of vector?)

Actually, the norm of a mathematical object is a quantity that in some (possibly abstract) sense describes the length, size, or extent of the object. The norms for vectors (or, data-points) are called vector norms. Vector norms generally belong to the family of Lₚ-norm which is defined by —

It can be shown that for any p > 0, ∣∣x∣∣ₚ defines a vector norm.

L₁ Norm

L₁ norm is commonly known as Manhattan Distance or Taxicab norm.

fig 1. Manhattan distance between two points (Image source: Trebuňa, Peter & Halčinová, Jana. (2012). Experimental Modelling of the Cluster Analysis Processes. Procedia Engineering. 48. 673–678. 10.1016/j.proeng.2012.09.569.)

Let us assume there are two data points. These data are having only two features (say, length and weight). So, we can plot these data in a 2-dimensional plane, length and weight being two axes. Let us consider figure 1 is representing these two points on length-weight mathematical plane. Two red dots are representing two points — (1,2) and (5,5). Now, if you have to walk from point 1 to point 2 on the condition that roads are only parallel to the axes, you have to first walk from 1 to 5 along length axis and then you have to walk from 2 to 5 along weight axis. So, a total 4 + 3 = 7 units of movement is done to reach the second point from the first point.

This is Manhattan distance. If you want to see mathematically, put p = 1 in formula — 1. The formula of L₁ norm will become — ∣∣ x ∣∣₁ = ∑xᵢ. So, L₁ norm of the point (1,2) will be 1+2 = 3 and L₁ norm of point (5,5) will be 5+5 = 10. And the Manhattan distance between these two points will be 10–3 = 7.

L₂ Norm

fig 2. Euclidean distance between two points (Image source: http://rosalind.info/glossary/euclidean-distance/)

L₂ norm is commonly known as Euclidean distance, the shortest distance between two points. If you put p = 2 in formula — 1, you can obtain L₂ norm of a point or the Euclidean distance of the point from origin (0,0) (see formula. 2 below).

So, if you need to measure the distance between two data points p and q, the formula 2 will take the following form —

There are few other p-norms. But for our discussion L₁ and L₂ norms are sufficient to know.

Mahalanobis distance

The Mahalanobis distance (MD) is another distance measure between two points in multivariate space. In Euclidean space, the axes are orthogonal (drawn at right angles to each other). Orthogonality implies that the variables (or feature variables) are uncorrelated. But, if the feature variables have some correlations among themselves, they cannot form an Euclidean space. For this case, Professor P. C Mahalanobis introduced a new method of distance calculation which is known after his name as Mahalanobis Distance.

Mahalanobis distance is the distance between a point and a distribution. And not between two distinct points. Mahalanobis Distance is a generic distance measurement technique that equals to Euclidean distance for uncorrelated variables.

Formula 4

The formula given above represents Mahalanobis distance. Here,

D = Mahalanobis Distance
x = the vector of the observation (data-point)
m = the vector of mean of each column
C = covariance matrix of feature variables

Let us try to understand the formula. The term (x-m) represents the distance between the observation and the mean. We multiplied it with the inverse of Covariance matrix. This is kind of a multivariate equivalent of the regular standardisation or z-score ( z = (x-μ)/σ). Effectively, it addresses the problem of scaling as well as correlation. If the variables are highly correlated, multiplying with C⁻¹ will scale down the distance between x and m. On the other hand if there is no correlation among X’s, C will become an identity matrix — which actually leaves no effect on (x-m).

Conclusion

In most high dimensional applications the choice of the distance metric is not obvious. There is hardly any literature available that guides you to use correct distance measure. In this paper, the authors presented some ideas about the correct usage of different norms for high dimensional space. They established that for a fixed dimensional problem using norms with lower p value is more useful than using higher values of p. This means using Manhattan Distance is more useful for higher dimensional problems followed by Euclidean distance. I am providing some more links below for getting more ideas about the correct usage of distance measures.

References

On the Surprising Behavior of Distance Metrics in High Dimensional Space (https://bib.dbvis.de/uploadedFiles/155.pdf)
Vector Norms (https://www.math.usm.edu/lambers/mat610/sum10/lecture2.pdf)
Clustering Distance Measures (https://www.datanovia.com/en/lessons/clustering-distance-measures/)
Gentle Introduction to Vector Norms in Machine Learning (https://machinelearningmastery.com/vector-norms-machine-learning/)