A brief introduction to Distance Measures

10 distance measures for machine learning you should have heard of

Jonte Dancker
9 min readOct 25, 2022
10 often used distance measures (based on M. Grootendorst)

Distance measures are the foundation for supervised and unsupervised learning algorithms, including K-nearest Neigbors, Self-Organizing Maps, Support Vector machines and K-means clustering.

The choice of distance measure affects our machine learning results and thus it is important to think about which measure suits the problem the most. Hence, we should be careful in deciding which measure to use.

But before we can make a decision, we need an understanding of how distance measures work and which we can choose from.

Hence, this article will give you a brief introduction on often used distance measures, how they work, how to calculate them with Python, and when they are used. With this you can deepen your knowledge and understanding, boosting your machine learning algorithms and results.

But before we dive deeper into the different distance measures, let me give you a general idea on how they work and how we pick the right one.

As their name infers, distance measures are used to calculate the difference between two objects in a given problem space, i.e., features in a data set. This distance can then be used to determine the similarity between features. The smaller the distance the more similar the features.

We can choose between geometric and statistical distance measures. Which distance measure we should choose depends on the type of data. Features might have different data types (e.g., real values, boolean, categorical), the data might be multi-dimensional or consists of geospatial data.

Geometric distance measures

Euclidean distance

The Euclidean distance measures the shortest distance between two real-valued vectors. Because of its intuitive use, simple implementation, and good results for many use cases, it is the most common distance measure and the default distance measure of many applications.

The Euclidean distance can also be refered to as a L2-norm and is calculated as:

To calculate the distance between to vectors in Python we can use

from scipy.spatial import distance
distance.euclidean(vector_1, vector_2)

The Euclidean distance has two major disadvantages. First, the distance measure does not work well for higher dimensional data than 2D or 3D space. Second, if we do not normalize and/or standardize our features, the distance might be skewed due to different units.

Manhattan distance

The Manhattan distance is also called the Taxicab or City-Block distance as the distance between two real-valued vectors is calculated as if one could only move at right angles. This distance measure is often used for discrete and binary attributes to get a realistic path.

The Manhattan distance is based on a L1-norm and is calculated by:

and can be implemented in Python by:

from scipy.spatial import distance
distance.cityblock(vector_1, vector_2)

The Manhattan distance has two major disadvantages. First, it is less intuitive than the Euclidean distance in high dimensional space and second, it does not show the shortest path possible. Although this might not be problematic, we should be aware of the higher distance.

Chebyshev distance

The Chebyshev distance is also referred to as the chessboard distance as it is the greatest distance on any dimension between two real-valued vectors. The distance measure is often used in warehouse logistics in which the longest path determines the time it takes to get from one point to the next.

The Chebyshev distance is calculated by the L-infinity-norm:

We can calculate the distance in Python by:

from scipy.spatial import distance
distance.chebyshev(vector_1, vector_2)

The Chebyshev distance has only very specific use case and is thus very seldom used.

Minkowski distance

The Minkowski distance is a generalized form of the above-mentioned distance measures. Hence, it can be used for the same use cases while providing a high flexiblity. We can choose the p-value to find the most suitable distance measure.

The Minkowski distance is calulated by:

and we can determine the distance in Python by:

from scipy.spatial import distance
distance.minkowski(vector_1, vector_2, p)

As the Minkowski distance represents different distance measures, it has the same major disadvantages, such as problems in higher dimensional space and a dependency on the units of the features. Furthermore, the flexibility of the p-value can also be a disadvantage as it might be computationally inefficient to find the right p-value.

Cosine similarity and distance

The Cosine Similarity is a measure of orientation, determined by the cosine between two vectors, which neglects the magnitude of the vectors. The Cosine Similarity is often used in higher dimensionality where the magnitude of the data does not matter much, e.g., for recommendation systems or text analyses in which the data is represented by the word count.

The Cosine Similarity can lie between -1 (opposite orientation) and 1 (same orientation) and is calculated by:

However, the Cosine Similarity is often used in positive space in which the range lies between 0 and 1. The Cosine distance, which substracts the cosine similarity from 1, lies between 0 (similar values) and 1 (different values). The cosine distance can be determined in Python by:

from scipy.spatial import distance
distance.cosine(vector_1, vector_2)

The main disadvantage of the Cosine distance is that is does not consider the magnitude but only the direction of vectors. Hence, the differences in values is not fully taken into account.

Haversine distance

The Haversine distance measures the shortest distance between two points on a sphere. Hence, the distance is used for navigation where points have longitudes and latitueds and in which the curvature has an effect.

The Haversine distance can be determined by:

in which r is the radius of the sphere and φ and λ are the longitude and latidue. The Haversine distance can be determined in Python using:

from sklearn.metrics.pairwise import haversine_distances
haversine_distances([vector_1, vector_2])

The main disadvantage of the Haversine distance is the assumption of a sphere, which is seldom the case.

Hamming distance

The Hamming distance measures the dissimilarity between two binary vectors or strings.

For this, the vectors are compared element-wise and the number of differences is averaged. The resulting distance lies between 0 if both vectors are identical and 1 if both vectors are completly different.

We can determine the Hamming distance in Python by:

from scipy.spatial import distance
distance.hamming(vector_1, vector_2)

The Hamming distance has two major disadvantages. First, the distance measure can only compare vector of the same length and second, it does not give the magnitude of the difference. Hence, the Hamming distance is not recommended when the magnitude of the difference matters.

Statistical distance measures

Statistical distance measures can be used for hypothesis testing, goodness-of-fit test, classification tasks or outlier detection.

Jaccard Index and distance

The Jaccard Index is used to determine the similarity between two sample sets. It reflects how many one-to-one matches exist compared to the overall data set. The Jaccard Index is often used for binary data to compare the prediction of a deep learning model for image recognition with labeled data or to compare text patterns in documents based on the overlap of words.

The Jaccard distance is calculated by:

and can be determined in Python by:

from scipy.spatial import distance
distance.jaccard(vector_1, vector_2)

The main disadvantage of the Jaccard Index and distance is that it is strongly influenced by the size of the data, i.e., each item is weighted inversely proportinal to the size of the data set.

Sörensen-Dice Index

The Sörensen-Dice Index is similar to the Jaccard Index as it measures the similarity and diversity of sample sets. However, the index is more intuitive as it calculates the percentage of the overlap. The Sörensen-Dice Index is often used for image segmentation and text similarity analysis.

The Sörensen-Dice distance is determined by:

We can implement the index in Python through:

from scipy.spatial import distance
distance.dice(vector_1, vector_2)

The main disadvantage of the Sörensen-Dice Index is that it is strongly influenced by the size of the data set.

Dynamic Time Warping

Dynamic Time Warping is an important distance method to measure the distance between two time series of different length. With this, Dynamic Time Warping can be used for all use cases using time series data, such as speech recognition or anomaly detection.

But why do we need another distance measure just for time series? If the time series are not of the same length or are distorted the above described distance measures are not able to determine a good similarity. For example, the Euclidean distance calculates the distance between both time series for each time step. But if both time series have the same shape but are shifted in time, the Euclidean distance would show a great dissimilarity although the time series are very similar.

Dynamic Time Warping avoids this issue by minimizing the overall distance between two time series, using many-to-one or one-to-many mapping. This results in a more intuitive similarity measure as the best alignment is searched. The distance is minimized by a warping path which is found through dynamic programming and which has to fulfill the following conditions:

  • boundary condition: the warping path begins and ends at the start and end points of both time series
  • monotonicity condition: the time order of points is preserved, avoiding going back in time
  • continuity condition: path transitions are limited to adjacent points in time, avoiding jumping in time
  • warping window condition (optional): allowable points fall into a warping windo of given width
  • slope condition (optional): slope of warping path is restricted, avoinding extreme movements

To determine the distance between two time series we can use the fastdtw package in Python:

from scipy.spatial.distance import euclidean
from fastdtw import fastdtw
distance, path = fastdtw(timeseries_1, timeseries_2, dist=euclidean)

A main disadvantage of the Dynamic Time Warping is its comparably higher computational effort compared to the other distance measures.

Conclusion

In this article I have given you a brief introduction on ten often used distance measures. I have shown you how they work, how you can implement them in Python, and for what problems they are used often.

If you think I missed an important distance measure, please let me know.

--

--

Jonte Dancker

Expert in time series forecasting and analysis | Writing about my data science side projects and sharing my learnings