Distance Metrics in Machine Learning

Nicolò Chen
AI Odyssey
Published in
5 min readApr 24, 2024
Photo from Pexels

Understanding Distance Metrics in Machine Learning

Have you ever heard about Distance Metrics? Probably most of you have encountered the concept of “Euclidean Distance” at school, but let me introduce you to this notion step by step. In the realm of machine learning, particularly in clustering, the idea of distance metrics plays a fundamental role. Distance metrics measure the similarity or dissimilarity between data points in a dataset. In other words, they deal with finding if data points are close or distant from each other and determining if they can be clustered together. They are crucial in various algorithms, aiding in pattern recognition, data analysis, and decision-making processes.

What are Distance Metrics in practice?

Recapping what we introduced previously, Distance Metrics, used in both supervised and unsupervised learning, quantify the similarity or dissimilarity between two data points assessing how ‘close’ or ‘far apart’ these two points are. The choice of distance metric significantly influences the performance of machine learning algorithms. There are many Distance Metrics, but the most common ones are: Euclidean Distance, Manhattan Distance, Hamming Distance and Minkowski Distance. Let’s go through them one by one!

Euclidean Distance

Let’s start with Euclidean Distance, which is the most well-known distance metric. It represents the shortest distance between two points in an Euclidean space and mathematically it is the square root of the sum of squares of differences between corresponding elements. The generalized formula that describes the Euclidean Distance between point p and q in a n-dimensional space is:

And here below you can see its representation in three dimension:

It is suitable for continuous data and widely used in clustering algorithms like K-means.

Manhattan Distance

Manhattan distance, also known as taxicab distance or city block distance, is a distance metric that calculates the distance between two points by summing the absolute differences of their coordinates. Unlike Euclidean distance, which measures the “shortest straight-line” distance, Manhattan distance represents the distance traveled along axis-aligned paths. It is particularly suitable for feature spaces with discrete data or when movement is restricted to orthogonal paths. To understand and visualize it better, look at the picture below where all the different colored lines are Manhattan Distances:

The Manhattan distance between two points p and q in an n-dimensional space is given by:

Where, pi and qi represent the coordinates of the points p and q respectively, in each dimension i; n denotes the total number of dimensions.

Manhattan distance finds applications in various domains such as route planning, image processing, cluster analysis, and feature engineering, offering a practical and interpretable measure of distance valuable in diverse applications across machine learning, image processing, and optimization tasks.

Hamming Distance:

Instead, Hamming Distance is a metric used to measure the dissimilarity between two strings of equal length. It calculates the minimum number of substitutions required to transform one string into another, considering only positions where the corresponding symbols are different.

Let’s understand this Distance Metric with a simple example: let’s consider two strings:

“Mouse” and “House”

Since they have the same length (5), we can calculate the Hamming Distance, which is 1 (they differ just for the first character).

Let’s consider another example:

“Crash” and “Brave”

They have the same length (5), but this time their Hamming Distance is 3 since they differ for 3 characters.

Note that the greater the Hamming Distance between two strings, the greater the dissimilarity between them (and vice-versa).

Mathematically, the Hamming distance formula is:

where ai and bi represent the symbols at position i in strings A and B respectively, and δ(ai, bi) is the Kronecker delta function, which equals 1 if ai is different from bi, and 0 otherwise.

It is particularly useful for comparing categorical data, such as DNA sequences, binary strings, or error detection and correction codes.

Minkowski Distance:

Minkowski distance is a generalization of various distance metrics, including Euclidean and Manhattan distances. It represents the distance between two points in an n-dimensional space and can be adjusted by a parameter p, which determines the sensitivity of the distance measure to different dimensions. Minkowski distance includes both Euclidean and Manhattan distances as special cases when p=2 and p=1 respectively.

The Minkowski distance between two points p and q in an n-dimensional space is calculated as:

Where, p is the parameter that controls the sensitivity of the distance measure as mentioned previously, and p=1 corresponds to Manhattan distance, p=2 corresponds to Euclidean distance. Other values of p represent different degrees of sensitivity to dimensions.

Overall, Minkowski distance is a flexible and versatile distance metric that can be tailored to specific applications by adjusting the parameter p. It is used in various domains, ranging from machine learning to optimization and network analysis.

Choosing the Right Distance Metric

Selecting an appropriate distance metric is crucial and it depends on the nature of the data and the specific requirements of the task. We have to consider many factors:

  • Data Type: Different distance metrics are suitable for different data types which can be numerical, categorical or binary (as seen in the various examples).
  • Feature Space: The dimension and structure of the feature space influence the choice of distance metric. For example, Euclidean distance is effective for continuous data in a Euclidean space, while Hamming distance is suitable for categorical data.
  • Algorithm Requirements: Certain machine learning algorithms have inherent assumptions about the data and may perform better with specific distance metrics.
  • Domain Knowledge: Understanding the underlying domain can help in selecting a meaningful distance metric that captures relevant similarities or dissimilarities between data points.

Conclusion

To conclude, Distance Metrics are fundamental components of many machine learning algorithms, aiding in data analysis, clustering, classification, and recommendation tasks. Understanding different distance metrics is crucial for selecting appropriate algorithms, achieving optimal results in machine learning tasks and deriving valuable insights from the data to make informed decisions based on the similarity and dissimilarity between data points.

--

--