Different types of Distances used in Machine Learning

Shubhobrata Das
11 min readJun 17, 2020

--

Role of Distance Measures

Distance measures play an important role in machine learning.

A distance measure is an objective score that summarizes the relative difference between two objects in a problem domain.

Most commonly, the two objects are rows of data that describe a subject (such as a person, car, or house), or an event (such as a purchase, a claim, or a diagnosis).

Perhaps the most likely way you will encounter distance measures is when you are using a specific machine learning algorithm that uses distance measures at its core. The most famous algorithm of this type is the k-nearest neighbors algorithm, or KNN for short.

In the KNN algorithm, a classification or regression prediction is made for new examples by calculating the distance between the new example (row) and all examples (rows) in the training dataset. The k examples in the training dataset with the smallest distance are then selected and a prediction is made by averaging the outcome (mode of the class label or mean of the real value for regression).

KNN belongs to a broader field of algorithms called case-based or instance-based learning, most of which use distance measures in a similar manner. Another popular instance-based algorithm that uses distance measures is the learning vector quantization, or LVQ, algorithm that may also be considered a type of neural network.

Related is the self-organizing map algorithm, or SOM, that also uses distance measures and can be used for supervised or unsupervised learning. Another unsupervised learning algorithm that uses distance measures at its core is the K-means clustering algorithm.

In instance-based learning the training examples are stored verbatim, and a distance function is used to determine which member of the training set is closest to an unknown test instance. Once the nearest training instance has been located, its class is predicted for the test instance.

A short list of some of the more popular machine learning algorithms that use distance measures at their core is as follows:

  • K-Nearest Neighbors
  • Learning Vector Quantization (LVQ)
  • Self-Organizing Map (SOM)
  • K-Means Clustering

There are many kernel-based methods may also be considered distance-based algorithms. Perhaps the most widely known kernel method is the support vector machine algorithm, or SVM for short.

Euclidean Distance:

You are most likely to use Euclidean distance when calculating the distance between two rows of data that have numerical values, such a floating point or integer values.

If columns have values with differing scales, it is common to normalize or standardize the numerical values across all columns prior to calculating the Euclidean distance. Otherwise, columns that have large values will dominate the distance measure.

Although there are other possible choices, most instance-based learners use Euclidean distance.

Euclidean distance is calculated as the square root of the sum of the squared differences between the two vectors.

  • Euclidean_Distance = sqrt(sum for i to N (v1[i] — v2[i])²)

If the distance calculation is to be performed thousands or millions of times, it is common to remove the square root operation in an effort to speed up the calculation. The resulting scores will have the same relative proportions after this modification and can still be used effectively within a machine learning algorithm for finding the most similar examples.

  • EuclideanDistance = sum for i to N (v1[i] — v2[i])²

This calculation is related to the L2 vector norm and is equivalent to the sum squared error and the root sum squared error if the square root is added.

We can demonstrate this with an example of calculating the Euclidean distance between two real-valued vectors, listed below.

# calculating euclidean distance between vectors
from math import sqrt

# calculate euclidean distance
def euclidean_distance(a, b):
return sqrt(sum((e1-e2)**2 for e1, e2 in zip(a,b)))

# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = euclidean_distance(row1, row2)
print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

6.082762530298219

Manhattan Distance (Taxicab or City Block Distance):-

The Manhattan distance, also called the Taxicab distance or the City Block distance, calculates the distance between two real-valued vectors.

It is perhaps more useful to vectors that describe objects on a uniform grid, like a chessboard or city blocks. The taxicab name for the measure refers to the intuition for what the measure calculates: the shortest path that a taxicab would take between city blocks (coordinates on the grid).

It might make sense to calculate Manhattan distance instead of Euclidean distance for two vectors in an integer feature space.

Manhattan distance is calculated as the sum of the absolute differences between the two vectors.

  • Manhattan_Distance = sum for i to N sum |v1[i] — v2[i]|

The Manhattan distance is related to the L1 vector norm and the sum absolute error and mean absolute error metric.

We can demonstrate this with an example of calculating the Manhattan distance between two integer vectors, listed below.

# calculating manhattan distance between vectors

from math import sqrt

# calculate manhattan distance

def manhattan_distance(a, b):

return sum(abs(e1-e2) for e1, e2 in zip(a,b))

# define data

row1 = [10, 20, 15, 10, 5]

row2 = [12, 24, 18, 8, 7]

# calculate distance

dist = manhattan_distance(row1, row2)

print(dist)

Running the example reports the Manhattan distance between the two vectors.

13

We can also perform the same calculation using the cityblock() function from SciPy. The complete example is listed below.

# calculating manhattan distance between vectors

from scipy.spatial.distance import cityblock

# define data

row1 = [10, 20, 15, 10, 5]

row2 = [12, 24, 18, 8, 7]

# calculate distance

dist = cityblock(row1, row2)

print(dist)

Minkowski Distance:

Minkowski distance calculates the distance between two real-valued vectors.

It is a generalization of the Euclidean and Manhattan distance measures and adds a parameter, called the “order” or “p“, that allows different distance measures to be calculated.

The Minkowski distance measure is calculated as follows:

  • Euclidean_Distance = (sum for i to N (abs(v1[i] — v2[i]))^p)^(1/p)

Where “p” is the order parameter.

When p is set to 1, the calculation is the same as the Manhattan distance. When p is set to 2, it is the same as the Euclidean distance.

  • p=1: Manhattan distance.
  • p=2: Euclidean distance.

Intermediate values provide a controlled balance between the two measures.

It is common to use Minkowski distance when implementing a machine learning algorithm that uses distance measures as it gives control over the type of distance measure used for real-valued vectors via a hyperparameter “p” that can be tuned.

We can demonstrate this calculation with an example of calculating the Minkowski distance between two real vectors, listed below.

# calculating minkowski distance between vectors

from math import sqrt

# calculate minkowski distance

def minkowski_distance(a, b, p):

return sum(abs(e1-e2)**p for e1, e2 in zip(a,b))**(1/p)

# define data

row1 = [10, 20, 15, 10, 5]

row2 = [12, 24, 18, 8, 7]

# calculate distance (p=1)

dist = minkowski_distance(row1, row2, 1)

print(dist)

# calculate distance (p=2)

dist = minkowski_distance(row1, row2, 2)

print(dist)

Running the example first calculates and prints the Minkowski distance with p set to 1 to give the Manhattan distance, then with p set to 2 to give the Euclidean distance, matching the values calculated on the same data from the previous sections.

13.0

6.082762530298219

We can also perform the same calculation using the minkowski_distance() function from SciPy. The complete example is listed below.

# calculating minkowski distance between vectors

from scipy.spatial import minkowski_distance

# define data

row1 = [10, 20, 15, 10, 5]

row2 = [12, 24, 18, 8, 7]

# calculate distance (p=1)

dist = minkowski_distance(row1, row2, 1)

print(dist)

# calculate distance (p=2)

dist = minkowski_distance(row1, row2, 2)

print(dist)

Running the example, we can see we get the same results, confirming our manual implementation.

13.0

6.082762530298219

Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.

Hamming Distance

Hamming distance calculates the distance between two binary vectors, also referred to as binary strings or bitstrings for short.

You are most likely going to encounter bitstrings when you one-hot encode categorical columns of data.

For example, if a column had the categories ‘red,’ ‘green,’ and ‘blue,’ you might one hot encode each example as a bitstring with one bit for each column.

  • red = [1, 0, 0]
  • green = [0, 1, 0]
  • blue = [0, 0, 1]

The distance between red and green could be calculated as the sum or the average number of bit differences between the two bitstrings. This is the Hamming distance.

For a one-hot encoded string, it might make more sense to summarize to the sum of the bit differences between the strings, which will always be a 0 or 1.

  • Hamming_Distance = sum for i to N abs(v1[i] — v2[i])

For bitstrings that may have many 1 bits, it is more common to calculate the average number of bit differences to give a hamming distance score between 0 (identical) and 1 (all different).

  • Hamming_Distance = (sum for i to N abs(v1[i] — v2[i])) / N

We can demonstrate this with an example of calculating the Hamming distance between two bitstrings, listed below.

# calculating hamming distance between bit strings

# calculate hamming distance

def hamming_distance(a, b):

return sum(abs(e1 — e2) for e1, e2 in zip(a, b)) / len(a)

# define data

row1 = [0, 0, 0, 0, 0, 1]

row2 = [0, 0, 0, 0, 1, 0]

# calculate distance

dist = hamming_distance(row1, row2)

print(dist)

Running the example reports the Hamming distance between the two bitstrings.

We can see that there are two differences between the strings, or 2 out of 6 bit positions different, which averaged (2/6) is about 1/3 or 0.333.

0.3333333333333333

We can also perform the same calculation using the hamming() function from SciPy. The complete example is listed below.

# calculating hamming distance between bit strings

from scipy.spatial.distance import hamming

# define data

row_1 = [0, 0, 0, 0, 0, 1]

row_2 = [0, 0, 0, 0, 1, 0]

# calculate distance

dist = hamming(row1, row2)

print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

0.3333333333333333

L1 Norm:

Also known as Manhattan Distance or Taxicab norm. L1 norm is the sum of the magnitudes of the vectors in space. It is the most natural way of measure distance between vectors, that is the sum of absolute difference of the components of the vectors. In this norm, all the components of the vector are weighted equally.

Having, for example, the vector X = [3,4]:

The L1 norm is calculated by

As you can see in the graphic, the L1 norm is the distance you have to travel between the origin (0,0) to the destination (3,4), in a way that resembles how a taxicab drives between city blocks to arrive at its destination.

L2 norm:

It is the most popular norm, also known as the Euclidean norm. It is the shortest distance to go from one point to another.

Using the same example, the L2 norm is calculated by

As you can see in the graphic, the L2 norm is the most direct route.

There is one consideration to take with the L2 norm, and it is that each component of the vector is squared, and that means that the outliers have more weighting, so it can skew results.

Cosine Similarity and Cosine Distance:

There are two terms: Similarity and Distance. They are inversely proportional to each other i.e if one increases the other one decreases and vice-versa. the formula for which is:

1-cos_sin=cos_distance

Cosine similarity formula can be derived from the equation of dot products:-

Now, you must be thinking which value of cosine angle will help find out the similarities.

Now that we have the values which will be considered to measure the similarities, we need to know what do 1, 0 and -1 signify.

Here cosine value 1 is for vectors pointing in the same direction i.e. there are similarities between the documents/data points. At zero for orthogonal vectors i.e. Unrelated(some similarity found). Value -1 for vectors pointing in opposite directions(No similarity).

To find the cosine distance, we simply have to put the values in the formula and compute.

Lp Norm:

In mathematics, the Lp spaces are function spaces defined using a natural generalization of the p-norm for finite-dimensional vector spaces. They are sometimes called Lebesgue spaces, named after Henri Lebesgue (Dunford & Schwartz 1958, III.3), although according to the Bourbaki group (Bourbaki 1987) they were first introduced by Frigyes Riesz (Riesz 1910). Lp spaces form an important class of Banach spaces in functional analysis, and of topological vector spaces. Because of their key role in the mathematical analysis of measure and probability spaces, Lebesgue spaces are used also in the theoretical discussion of problems in physics, statistics, finance, engineering, and other disciplines.

The set of L^p-functions (where p>=1) generalizes L2-space. Instead of square integrable, the measurable function f must be p-integrable for f to be in L^p.

On a measure space X, the L^p norm of a function f is:

The L^p-functions are the functions for which this integral converges. For p!=2, the space of L^p-functions is a Banach space which is not a Hilbert space.

The L^p-space on R^n, and in most other cases, is the completion of the continuous functions with compact support using the L^p norm. As in the case of an L2-space, an L^p-function is really an equivalence class of functions which agree almost everywhere. It is possible for a sequence of functions f_n to converge in L^p but not in L^(p^’) for some other p^’, e.g., f_n=(1+x²)^(-1/2–1/n) converges in L²(R) but not L¹(R). However, if a sequence converges in L^p and in L^(p^’), then its limit must be the same in both spaces.

For p>1, the dual vector space to L^p is given by integrating against functions in L^q, where 1/p+1/q=1. This makes sense because of Hölder’s inequality for integrals. In particular, the only L^p-space which is self-dual is L².

While the use of L^p functions is not as common as L², they are very important in analysis and partial differential equations. For instance, some operators are only bounded in L^p for some p>2.

--

--