Role of Distance Metrics in Machine Learning

Writuparna Banerjee
Analytics Vidhya
Published in
10 min readJun 12, 2020

--

Distance metrics play an important role in machine learning. They provide a strong foundation for several machine learning algorithms like k-nearest neighbors for supervised learning and k-means clustering for unsupervised learning. Different distance metrics are chosen depending upon the type of the data. So, it is important to know the various distance metrics and the intuitions behind it.

An effective distance metric improves the performance of our machine learning model, whether that’s for classification tasks or clustering.

Let’s say we want to create clusters using the k-Nearest Neighbor algorithm to solve a classification or regression problem. How can we say that two points are similar to one another?

This will happen if their features are similar. When we plot these points, they will be closer to each other in distance.

Hence, we can calculate the distance between points and then define the similarity between them. Now, the question arises— how do we calculate this distance and what are the different distance metrics in machine learning?

That’s what we will discuss in this article. We will go through 6 types of distance metrics in machine learning.

Types of Distance Metrics in Machine Learning

  1. Euclidean Distance

2. Manhattan Distance

3. Minkowski distance

4. Hamming Distance

5. Cosine Distance

1. Euclidean Distance

Euclidean Distance represents the shortest distance between two points.

Euclidean distance formula can be used to calculate the distance between two data points in a plane.

Euclidean distance is generally used when calculating the distance between two rows of data that have numerical values, such as floating point or integer values.

If columns have values with differing scales, it should be normalized or standardized before calculating the Euclidean distance. Otherwise, columns that have large values will dominate the distance measure.

Euclidean distance is calculated as the square root of the sum of the squared differences between the two vectors.

euclidean distance formula

Where,

n = number of dimensions

pi, qi = data points

This calculation is related to the L2 vector norm(discussed later).

Now, let’s stop and look carefully! Does this formula look familiar? Well yes, this formula comes from the “Pythagorean Theorem”.

Let’s write the code of Euclidean Distance in Python. We will first import the the SciPy library that contains pre-written codes for most of the distance functions used in Python:

This is how we can calculate the Euclidean Distance between two points in Python.

2. Manhattan Distance

Manhattan Distance is the sum of absolute differences between points across all the dimensions.

We use Manhattan distance, also known as city block distance, or taxicab geometry if we need to calculate the distance between two data points in a grid-like path just like a chessboard or city blocks.

The name taxicab refers to the intuition for what the measure calculates: the shortest path that a taxicab would take between city blocks (coordinates on the grid).

Let’s say, we want to calculate the distance, d, between two data points- A and B.

Distance d will be calculated using an absolute sum of difference between its cartesian co-ordinates as below :

manhattan distance formula

And the generalized formula for an n-dimensional space is given as:

manhattan distance

Where,

n = number of dimensions

pi, qi = data points

The Manhattan distance is related to the L1 vector norm (discussed later).

If you try to visualize the distance calculation, it will look something like below :

In the above picture, imagine each cell to be a building, and the grid lines to be roads. Now if we want to travel from Point P to Point Q marked in the image and follow the sky-blue and navy-blue paths , we see that the path is not straight and there are turns. In this case, we use the Manhattan distance metric to calculate the distance walked. The pink line joining the two points P and Q is the Manhattan distance. Now the distance d will be calculated as shown by the yellow line .

When is Manhattan distance metric preferred in ML?

The Manhattan Distance is preferred over the Euclidean distance metric as the dimension of the data increases. This occurs due to something known as the ‘curse of dimensionality’. For further details, please visit this link.

Now, we will calculate the Manhattan Distance between the two points. SciPy has a function called cityblock that returns the Manhattan Distance between two points.

3. Minkowski Distance

Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.

Minkowski Distance calculates the distance between two points.

It is a generalization of the Euclidean and Manhattan distance measures and adds a parameter, called the “order” or “p“, that allows different distance measures to be calculated.

The Minkowski distance measure is calculated as follows:

Minkowski Distance

Where “p” is the order parameter.

When p is set to 1, the calculation is the same as the Manhattan distance. When p is set to 2, it is the same as the Euclidean distance.

  • p=1: Manhattan distance.
  • p=2: Euclidean distance.

Intermediate values provide a controlled balance between the two measures.

The Minkowski distance is related to the Lp vector norm (discussed later).

It is common to use Minkowski distance when implementing a machine learning algorithm that uses distance measures as it gives control over the type of distance measure used for real-valued vectors via a hyperparameter “p” that can be tuned.

Let’s calculate the Minkowski Distance of the order 3:

When the order(p) is 1, it will represent Manhattan Distance and when the order in the above formula is 2, it will represent Euclidean Distance. Let’s verify that in Python:

Here, we can see that when the order is 1, both Minkowski and Manhattan Distance are the same.

Let’s verify the Euclidean Distance as well:

When the order is 2, we can see that Minkowski and Euclidean distances are the same.

While reading this article you must have come across the words L1 norms, L2 norms. So, lets discuss them in details.

Vector Norm

Calculating the size or length of a vector is often required either directly or as part of a vector-metric operation.

The length of the vector is referred to as the vector norm or the vector’s magnitude.

The length of a vector is a non-negative number that describes the extent of the vector in space, and is sometimes referred to as the vector’s magnitude or the norm.

The length of the vector is always a positive number, except for a vector of all zero values. It is calculated using some distance metrics that summarizes the distance of the vector from the origin of the vector space. For example, the origin of a vector space for a vector with 3 elements is (0, 0, 0).

We will take a look at a few common vector norm calculations used in machine learning.

Vector L1 Norm

The length of a vector can be calculated using the L1 norm. The L1 norm, represented as ||v||1 is calculated as the sum of the absolute vector values, where the absolute value of a scalar uses the notation |a1|. Clearly, the norm is a calculation of the Manhattan distance from the origin of the vector space.

||v||1 = |a1| + |a2| + |a3|

The L1 norm of a vector can be calculated in NumPy using the norm() function with a parameter to specify the norm order, in this case 1.

Vector L2 Norm

The L2 norm, represented as ||v||2 is calculated as the square root of the sum of the squared vector values.Clearly, the norm is a calculation of the Euclidean distance from the origin of the vector space.

||v||2 = sqrt(a1² + a2² + a3²)

The L2 norm of a vector can be calculated in NumPy using the norm() function with default parameters.

Vector Lp Norm

The Lp norm, represented as ||v||p is calculated as follows from the origin of the vector space:

||v||p=(a1^p + a2^p + a3^p)^(1/p)

Clearly, the norm is a calculation of the Minkowski distance from the origin of the vector space.

So far, we have covered the distance metrics that are used when we are dealing with continuous or numerical variables. But what if we have categorical variables? How can we decide the similarity between categorical variables? This is where we can make use of another distance metric called Hamming Distance.

4. Hamming Distance

Hamming Distance measures the similarity between two strings of the same length. The Hamming Distance between two strings of the same length is the number of positions at which the corresponding characters are different.

Let’s understand the concept using an example. Let’s say we have two strings:

“euclidean” and “manhattan”

Since the length of these strings is equal, we can calculate the Hamming Distance. We will go character by character and match the strings. The first character of both the strings (e and m respectively) is different. Similarly, the second character of both the strings (u and a) is different. and so on.

Look carefully — seven characters are different whereas two characters (the last two characters) are similar:

distance metrics

Hence, the Hamming Distance here will be 7. Note that larger the Hamming Distance between two strings, more dissimilar will be those strings (and vice versa).

Let’s see how we can compute the Hamming Distance of two strings in Python.

As we saw in the example above, the Hamming Distance between “euclidean” and “manhattan” is 7. We also saw that Hamming Distance only works when we have strings of the same length.

Let’s see what happens when we have strings of different lengths:

This throws an error saying that the lengths of the arrays must be the same. Hence, Hamming distance only works when we have strings or arrays of the same length.

These are some of the similarity measures or the distance matrices that are generally used in Machine Learning.

5.Cosine Distance & Cosine Similarity:

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

Cosine similarity formula can be derived from the equation of dot products :-

So, cosine similarity is given by Cos θ, and cosine distance is 1- Cos θ. Example:-

θ = 90

In the above image, there are two data points shown in blue, the angle between these points is 90 degrees, and Cos 90 = 0. Therefore, the shown two points are not similar, and their cosine distance is 1 — Cos 90 = 1.

θ = 0

Now if the angle between the two points is 0 degrees in the above figure, then the cosine similarity, Cos 0 = 1 and Cosine distance is 1- Cos 0 = 0. Then we can interpret that the two points are 100% similar to each other.

θ=60

In the above figure, imagine the value of θ to be 60 degrees, then by cosine similarity formula, Cos 60 =0.5 and Cosine distance is 1- 0.5 = 0.5. Therefore the points are 50% similar to each other.

Let’s code for a better understanding. We have to import the cosine_similarity library for this purpose.

Note that the first value of the array is 1.0 because it is the Cosine Similarity between the first document with itself. Also note that due to the presence of similar words on the third document (“The sun in the sky is bright”), it achieved a better score.

Conclusion

In this article, we got to know about few popular distance/similarity metrics and how these can be used in order to solve complicated machine learning problems.We studied about Minkowski, Euclidean, Manhattan, Hamming, and Cosine distance metrics and their uses.

Manhattan distance is usually preferred over the more common Euclidean distance when there is high dimensionality in the data. Hamming distance is used to measure the distance between categorical variables, and the Cosine distance metric is mainly used to find the amount of similarity

Thanks for reading! 😊 If you enjoyed it, hit 👏 icon . It’s a great exercise for your fingers and will help other people see the story.

--

--

Writuparna Banerjee
Analytics Vidhya

Data Science and Machine Learning enthusiast | Front-end Web Developer | Technical blogger