Know Thy Distance
Distance measures (or metrics if we want to be strict about it) are very important in the field of machine learning. They are utilised both in supervised and unsupervised algorithms. The mastery over distance measures allows machine learning scientists to create better models and novel solutions. Take the counterfactuals for example which are defined as ‘what-if’ instances that yield different outputs. By using distance measures, we can quantify how easy it is to produce counterfactuals which will give an overview of the adversarial robustness of your machine learning models.
Unfortunately, the majority of people are only familiar with Euclidean distance which is by the way very useful. However, Euclidean distance itself is just the tip of an iceberg. There are many more measures that are as useful (and mindblowing!) depending on the use cases. In this article, we will be exploring a few of them.
The all-powerful and useful Euclidean. The metric represents the length of a straight line between two points. In the field of machine learning, this can be used as the measure of similarity between continuous features in a space. Classic clustering algorithms such as Kmeans and KNN use this measure by default to group observations together.
Have you ever wondered why the navigation map during flights shows a curved path instead of a linear one? Isn’t the shortest distance between two points a straight line? If the earth was flat, then the shortest distance is a straight line. But, since the earth is spherical, the shortest distance is actually an arc of a circle which is calculated using the haversine distance.
To use the formula above, there are three variables that we need:
- The Radius of the earth
- Latitude and longitude of point A
- Latitude and longitude of point B
Given the information above, it is clear that the haversine distance works best when we are working with geospatial data.
Matching Criterion and Jaccard distance
Measuring the distance between discrete variables requires a different approach compared to their continuous cousins. One way is to use the intersection and union of sample sets. To put it on laymen's terms, we are counting the number of common values. An easy way to calculate the Jaccard distance is to first calculate Jaccard index:
Looking at the formula above, the Jaccard index is simply the proportion of common items. Jaccard distance will then be
Unlike Euclidean and Haversine which is unbounded (must be positive though), Jaccard distance is bounded between 0 and 1. Additionally, while Jaccard distance can also work for comparing two data points with more than 1 categorical variable, a more efficient method will be to utilise the following matching criterion¹:
While w is usually 1, we can assign different w values for different features to emphasize the importance. The distance can then be calculated by replacing the Jaccard index with the Similarity measure. The only caveat to use this is that w must be bounded between 0 and 1 because otherwise, the distance can be negative! When we are dealing with data that consists of purely categorical features, this distance measure is the right choice!
The cosine distance measures how “different” two non-zero vectors are in an inner product space. One application of cosine distance is to measure if a group of texts is likely to contain the same terms.
To find the distance, we will first transform texts into their vector representations by counting the number of terms contained within them. For example, the vector representation of the sentence “Hi, I am a full-time nerd” will be 1 unit in the direction of the ‘Hi’ axis (x-axis), 1 unit in the direction of the ‘full-time’ axis (y-axis), and 1 unit in the direction of the ‘nerd’ axis (z-axis) assuming we exclude the pronouns and to-be verb. After that, we will calculate the Cosine similarity (2nd term in the equation above). The Cosine similarity is actually bounded between -1 to 1. A score of -1 means the vectors have an opposite orientation to one another while a score of 1 means they have the same orientation. A score of 0 means they are perpendicular. But when we are counting the terms that are contained within texts, the Cosine similarity is bounded between 0 and 1 because the number of terms can never be negative. As a result, the Cosine distance is bounded to a value that ranges between 0 and 1 as well.
The Levenshtein distance measures the difference between two sequences specifically, the number of required edits ( whether it is insertions, deletions, or substitutions) required to change one word to another. It is particularly useful when we want to perform fuzzy text matching. One possible application for example is the auto-correction functionality that all of us are familiar with. To understand the equation, let’s just use an example.
Suppose we want to find the Levenshtein distance between the word ‘glove’ and ‘cove’. Then the Levenshtein distance will be 2 because converting ‘glove’ to ‘cove’ involves two edits:
- remove g from ‘glove’
- replace l with c
This distance metric is symmetric meaning, converting ‘cove’ to ‘glove’ will also yield a Levenshtein distance of 2.
The Wasserstein distance falls under the statistical distance family and measures the distance between two probability distributions. While there are other countless measures such as the Kullback-Leibler divergence, the Wasserstein has a better intuition - the metric can be interpreted as the minimum ‘cost’ of converting one probability distribution to another. One useful application of the Wasserstein distance will be to create synthetic data. Good synthetic data must be as “similar” as possible to the original and the “similarity” can be measured by how close the distributions of the two are. This is because we can think of our data (no matter how many features we have) as a collection of multivariate distributions. As such, the Wasserstein distance can be used as a cost function to be optimised by algorithm such as Generative Adversarial Network (GAN)².
How can I implement these measures?
To implement and apply these measures, we can use python packages unless you want to implement the algorithms yourselves which can be a good side project. Here is a list of available python packages that I can find:
- scikit-learn pairwise metrics
Bonus content — understanding the meme
In case you are wondering, Pythagoras, is our main character. His theorem is definitely something that all of us are familiar with. Using his theorem, the length of the hypotenuse is 1.41. However, if we are considering the Chebyshev distance the length of the hypotenuse is actually 1!
Introducing - the Chebyshev distance
The Chebyshev distance is also known as the chessboard distance. It measures the maximum difference between two vectors along any coordinate dimension. A good analogy will be the king in the game of chess where the distance can be seen as the number of moves it requires to move from one square to another. Since he can move in all directions, moving from one square to another diagonally only takes 1 move and hence, the Chebyshev distance will be 1 as well.