Different Types of Distances Used in Machine Learning
What is Distance Metrics?
A metric or distance function is a function d(x,y), that defines the distance between elements of a set as a non-negative real number. If the distance is zero, both elements are equivalent under that specific metric. Distance functions thus provide a way to measure how close two elements are, where elements do not have to be numbers but can also be vectors, matrices or arbitrary objects. Distance functions are often used as error or cost functions to be minimized in an optimization problem.We have often heard the use of distance metrics in supervised ML Algorithms like K Nearest Neighbor and unsupervised ML Algorithms like clustering.
The main agenda of distance metrics is to show that if two points p1 & p2 in n dimensional space lie near to each other according to the distance metric used,then the two points maybe similar.
Classification of metrics:
1 Euclidean distance: When we were talking about distances earlier, we mostly think about distances in a more or less straight line.
If we think of distances between two cities, we think about how many kilometre we have to drive on a highway.
These examples of distances that we can think of are examples of Euclidean distance. Essentially, it measures the length of a segment that connects two points. Let’s see this in a graph:
For n-points, the general formula is as follows:
Where x and y are two vectors.
Euclidean distance is the most commonly used distance for machine learning algorithms. It is very useful when our data is continuous. It is also called L2-Norm.
2 Manhattan Distance:The Manhattan Distance function calculates the distance traveled if a grid-like path is taken, to get from one data point to another. The gap between two objects in the Manhattan equation is the sum of the variations between their respective components.
Below is the formula for Manhattan Distance:
where, n- number of variables, xi and yi are the variables of vectors x and y respectively, in the two dimensional vector space. i.e. x = (x1,x2,x3,…) and y = (y1,y2,y3,…).
Now the distance d will be calculated as-
(x1 — y1) + (x2 — y2) + (x3 — y3) + … + (xn — yn).
If you try to visualize the distance calculation, it will look something like as below :
Manhattan distance is also known as Taxicab Geometry, City Block Distance etc.
Now the question arises why would we use Manhattan distance over Euclidean distance?Well the answer is simple.The use of Manhattan distance depends a lot on the kind of co-ordinate system that your dataset is using. While Euclidean distance gives the shortest or minimum distance between two points, Manhattan has specific implementations.
For example, if we were to use a Chess dataset, the use of Manhattan distance is more appropriate than Euclidean distance. Another use would be when are interested in knowing the distance between houses which are few blocks apart.
Also, you might want to consider Manhattan distance if the input variables are not similar in type (such as age, gender, height, etc.). Due to the curse of dimensionality, we know that Euclidean distance becomes a poor choice as the number of dimensions increases.If you want to place less emphasis on outliers, Manhattan distance will try to reduce all errors equally since the gradient has constant magnitude.
But one would also have that problem if using the Manhattan distance (only that the problem would be slightly mitigated because we don’t square the difference like we do on the Euclidean distance).
Typically use Euclidean metric; Manhattan may be appropriate if different dimensions are not comparable.
3 Cosine Similarity and Cosine Distance:This metric is highly used in recommendation systems.In cosine metric we measure the degree of angle between two documents/vectors(the term frequencies in different documents collected as metrics). This particular metric is used when the magnitude between vectors does not matter but the orientation.
Cosine similarity formula can be derived from the equation of dot products :-
The basic idea behind cosine similarity and cosine distance is that if the cosine distance increases the cosine similarity decreases and vice versa.
Cosine_distance = 1 - cosine_similarity
Let us understand the working principle of the recommendation system on the basis of cosine similarity. There are two features along x and y axis.Suppose we need to provide a recommendation for the point v(d3).The recommendation system will calculate the cosine similarity according to the all the other point present in the plot and the point having maximum cosine similarity will be recommended(in this case v(d2)).
It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.
Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors.
4Minkowski distance: First of all, we will define some mathematical terms in order to define Minkowski distance afterward.
- A vector space is a collection of objects called vectors that can be added together and multiplied by numbers (also called scalars).
- A norm is a function that assigns a strictly positive length to each vector in a vector space (The only exception is the zero vector which length is zero). It is usually represented as ∥x∥.
- A Normed vector space is a vector space over the real or complex numbers on which a norm is defined.
What does this have to do with Minkowski distance?
Minkowski distance is defined as the similarity metric between two points in the normed vector space (N-dimensional real space).
It represents also a generalized metric that includes Euclidean and Manhattan distance.
How does the formula look like?
If we pay attention when λ = 1, we have the Manhattan distance. If λ = 2, we are in the presence of Euclidean distance. There is another distance called Chebyshev distance that happens when λ = ∞.
Overall, we can change the value of λ to calculate the distance between two points in many ways.
When do we use it? Minkowski distance is frequently used when the variables of interest are measured on ratio scales with an absolute zero value.
L1 Norm:
Also known as Manhattan Distance or Taxicab norm(when λ = 1). L1 Norm is the sum of the magnitudes of the vectors in a space. It is the most natural way of measure distance between vectors, that is the sum of absolute difference of the components of the vectors. In this norm, all the components of the vector are weighted equally.
L2 Norm:
Is the most popular norm, also known as the Euclidean norm(when λ = 2). It is the shortest distance to go from one point to another.There is one consideration to take with L2 norm, and it is that each component of the vector is squared, and that means that the outliers have more weighting, so it can skew results.
L-infinity norm:
Gives the largest magnitude among each element of a vector(when λ = infinity)
Having the vector X= [-6, 4, 2], the L-infinity norm is 6.
In L-infinity norm, only the largest element has any effect. So, for example, if your vector represents the cost of constructing a building, by minimizing L-infinity norm we are reducing the cost of the most expensive building
5Hamming Distance:The Hamming Distance compares every letter of the two strings based on position. So the first letter of word 1 is compared to the first letter of word 2 etc etc.
The Hamming Distance compares every letter of the two strings based purely on position.
To compute the Hamming distance between two strings, you compare the characters of each position in the string. The number of unequal characters is the Hamming distance.
An advantage of the Hamming Distance is that it is very fast and simple to do this position-wise comparison. On the other hand, critics are that it cannot take into account two strings with an unequal number of letters. Another critic is that it is too strict, for example, “abcdefg” and “bcdefgh” are considered totally different, while 6 out of 7 characters are the same.