Types of Distances in Machine Learning

Published in

Analytics Vidhya

11 min readNov 10, 2019

Ever wondered how the Machine Learning algorithms calculate the distance? Or for once, did it come to your mind that there can be more than one way to calculate distance?

When you read the word “Distance”, the basic definition of the word comes out in your mind which states it is a numerical measurement of how far apart objects or points are. But when you hear the word along with Machine Learning you may think that it is different and guess what? It is!

When we say distance, what we mean is the distance metrics. And the basic Mathematics Definition, Distance metric uses distance function which provides a relationship metric between each element in the dataset.

Several Machine Learning Algorithms — Supervised or Unsupervised, use Distance Metrics to know the input data pattern to make any Data-Based decision. A good distance metric helps in improving the performance of Classification, Clustering, and Information Retrieval process significantly.

Distance Metrics

There are many distance metrics, but in this article, we will only be discussing a few widely used distance metrics. We will first try to understand the mathematics behind these metrics and then we will identify the machine learning algorithms where we use these distance metrics.

Below are the commonly used distance metrics -

Minkowski Distance:

When we think about distance, we usually imagine distances between cities. That is the most intuitive understanding of the distance concept. Fortunately, this example is perfect for explaining the constraints of Minkowski distances. We can calculate Minkowski distance only in a normed vector space, which is a fancy way of saying: “in a space where distances can be represented as a vector that has a length.”

Let’s start by proving that a map is a vector space. If we take a map, we see that distances between cities are normed vector space because we can draw a vector that connects two cities on the map. We can combine multiple vectors to create a route that connects more than two cities. Now, the adjective “normed.” It means that the vector has its length and no vector has a negative length. That constraint is met too because if we draw a line between cities on the map, we can measure its length.

Again, a normed vector space is a vector space on which a norm is defined. Suppose X is a vector space then a norm on X is a real-valued function ||x||which satisfies below conditions -

Zero Vector- Zero vector will have zero length. To say, If we look at a map, it is obvious. The distance from a city to the same city is zero because we don’t need to travel at all. The distance from a city to any other city is positive because we can’t travel -20 km.
Scalar Factor- The direction of the vector doesn’t change when you multiply it with a positive number though its length will be changed. Example: We traveled 50 km North. If we travel 50 km more in the same direction, we will end up 100 km North. The direction does not change.
Triangle Inequality- If the distance is a norm then the calculated distance between two points will always be a straight line.

You might be wondering why do we need normed vector, can we just not go for simple metrics? Normed vector has the above properties which help to keep the norm induced metric- homogeneous and translation invariant.

The distance can be calculated using the below formula -

Minkowski distance is the generalized distance metric. Here generalized means that we can manipulate the above formula to calculate the distance between two data points in different ways.

As mentioned above, we can manipulate the value of p and calculate the distance in three different ways-

p = 1, Manhattan Distance

p = 2, Euclidean Distance

p = ∞, Chebychev Distance (won’t be discussed in this article)

Manhattan Distance:

We use Manhattan Distance if we need to calculate the distance between two data points in a grid-like path. As mentioned above, we use the Minkowski distance formula to find Manhattan distance by setting p’s value as 1.

Let’s say, we want to calculate the distance, d, between two data points- x and y.

Distance d will be calculated using an absolute sum of the difference between its cartesian coordinates as below :

where, n- number of variables, xi and yi are the variables of vectors x and y respectively, in the two-dimensional vector space. i.e. x = (x1,x2,x3,…) and y = (y1,y2,y3,…).

Now the distance d will be calculated as-

(x1 — y1) + (x2 — y2) + (x3 — y3) + … + (xn — yn).

If you try to visualize the distance calculation, it will look something like as below :

Manhattan distance is also known as Taxicab Geometry, City Block Distance, etc.

When we can use a map of a city, we can give direction by telling people that they should walk/drive two city blocks North, then turn left and travel another three city blocks. In total, they will travel five city blocks, which is the Manhattan distance between the starting point and their destination.

L1 Norm:

Also known as Manhattan Distance or Taxicab norm. L1 norm is the sum of the magnitudes of the vectors in space. It is the most natural way of measure distance between vectors, that is the sum of absolute difference of the components of the vectors. In this norm, all the components of the vector are weighted equally.

Having, for example, the vector X = [3,4]:

The L1 norm is calculated by

As you can see in the graphic, the L1 norm is the distance you have to travel between the origin (0,0) to the destination (3,4), in a way that resembles how a taxicab drives between city blocks to arrive at its destination.

Euclidean Distance:

Euclidean distance is one of the most used distance metrics. It is calculated using the Minkowski Distance formula by setting p’s value to 2. This will update the distance d’ formula as below

Euclidean distance formula can be used to calculate the distance between two data points in a plane.

If we look again at the city block example used to explain the Manhattan distance, we see that the traveled path consists of two straight lines. When we draw another straight line that connects the starting point and the destination, we end up with a triangle. In this case, the distance between the points can be calculated using the Pythagorean theorem.

L2 norm:

It is the most popular norm, also known as the Euclidean norm. It is the shortest distance to go from one point to another.

Using the same example, the L2 norm is calculated by

As you can see in the graphic, the L2 norm is the most direct route.

There is one consideration to take with the L2 norm, and it is that each component of the vector is squared, and that means that the outliers have more weighting, so it can skew results.

Hamming Distance

A Hamming distance in information technology represents the number of points at which two corresponding pieces of data can be different. It is often used in various kinds of error correction or evaluation of contrasting strings or pieces of data.

While it may seem complicated and obscure at first glance, the Hamming distance is a very practical metric for measuring data strings. The Hamming distance involves counting up which set of corresponding digits or places are different, and which are the same. For example, take the text string “hello world” and contrast it with another text string, “herra poald.” There are five places along the corresponding strings where the letters are different.

Why is this important? One fundamental application of Hamming distance is to correct binary code either toward one result or another. Professionals talk about one-bit errors or two-bit errors, the idea that corrupted data can be transformed into a correct original result. The problem is, if there are two strings and one corrupted piece of data, one must ascertain which final result the corrupted or third data set is closest to. That is where the Hamming distance comes in — for example if the Hamming distance is four, and there is a one-bit error toward one result, it is most likely that that is the correct result. This is just one of the applications that the Hamming distance can have toward code and data string evaluation.

Example:

Suppose there are two strings 1101 1001 and 1001 1101.

11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance, d(11011001, 10011101) = 2.

Minimum Hamming Distance

In a set of strings of equal lengths, the minimum Hamming distance is the smallest Hamming distance between all possible pairs of strings in that set.

Cosine Similarity and Cosine Distance:

There are two terms: Similarity and Distance. They are inversely proportional to each other i.e if one increases the other one decreases and vice-versa. the formula for which is:

1-cos_sin=cos_distance

Cosine similarity formula can be derived from the equation of dot products:-

Now, you must be thinking which value of cosine angle will help find out the similarities.

Now that we have the values which will be considered to measure the similarities, we need to know what do 1, 0 and -1 signify.

Here cosine value 1 is for vectors pointing in the same direction i.e. there are similarities between the documents/data points. At zero for orthogonal vectors i.e. Unrelated(some similarity found). Value -1 for vectors pointing in opposite directions(No similarity).

To find the cosine distance, we simply have to put the values in the formula and compute.

Machine Learning Modelling and distance metrics

This section is to help one in understanding the usage of distance metrics in machine learning modeling using examples.

1. Classification

K-Nearest Neighbors(KNN)-

KNN is a non-probabilistic supervised learning algorithm i.e. it doesn’t produce the probability of membership of any data point rather KNN classifies the data on hard assignment, e.g the data point will either belong to 0 or 1. KNN uses distance metrics to find similarities or dissimilarities.

Taking the example of the iris dataset which has three classes, we will see how KNN will identify the classes for test data.

In the #2 image above the black square is a test data point. Now, we need to find which class this test data point belongs to, with the help of the KNN algorithm.

To find the nearest neighbors we use the distance metrics. First, we calculate the distance between each train and test data point and then select the top nearest according to the value of k(K is the number of nearest neighbors of a test data point. These K data points then will be used to decide the class for a test data point.)

Considering this code,

#Create a modelKNN_Classifier = KNeighborsClassifier(n_neighbors = 6, p = 2, metric='minkowski')

We are using the Minkowski distance metric with a value of p as 2 i.e. KNN classifier is going to use Euclidean Distance Metric formula.

As we move forward with machine learning modeling we can now train our model and start predicting the class for test data.

#Train the model
KNN_Classifier.fit(x_train, y_train)#Let's predict the classes for test data
pred_test = KNN_Classifier.predict(x_test)

Once the top nearest neighbors are selected, we check the most voted class in neighbors -

From the above image, It’s class 1 as it is the most voted class.

Through this small example, we saw how distance metric was important for the KNN classifier. It helped us to get the closest train data points for which classes were known.

2. Clustering

K-means:

In classification algorithms, probabilistic or non-probabilistic we will be provided with labeled data so, it gets easier to predict the classes. Though in the clustering algorithm we have no information on which data point belongs to which class. Distance metrics are an important part of this kind of algorithm.

In K-means, we select several centroids that define the number of clusters. Each data point will then be assigned to its nearest centroid using a distance metric (Euclidean). We will be using iris data to understand the underlying process of K-means.

In the above image #1 as you can see we randomly placed the centroids and in image #2, using distance metric tried to find their closest cluster class.

As we saw in the above example, without having any knowledge about the labels with the help of distance metric in K-Means we clustered the data into 3 classes just by using the algorithms. Thus, again making the distance metrics important.

3. Natural Language Processing

Information Retrieval

In information retrieval, we work with unstructured data. With the help of techniques used in NLP, we can create vector data in a manner that can be used to retrieve information when queried. Once the unstructured data is transformed into vector form, we can use the cosine similarity metric to filter out the irrelevant documents from the corpus.

Conclusion

Once a person gets into the depth of Machine Learning or Data Analysis or in any Data Science field, he/she will need an in-depth knowledge of the distances available in the course to use the correct distance metrics in the correct place to achieve the best result. Hence this article aims to provide people with knowledge about some popular distance/similarity metrics and how and where these can be used to solve complicated machine learning problems.