Distance Metrics: Concepts and Uses in Machine Learning Models
When I started my journey on Machine Learning I was only aware about a simple distance formula between point A and point B in 2D plane
But once I started learning more, I figured that it is called as Euclidean distance. But to my surprise there were many types of distance metrics and not just this one. So I started exploring all of them. Here’s all the distance metrics I have learned throughout my machine learning journey.
Computational Distances
- Euclidean Distance:
Ordinary distance between 2 points in N dimensional space.
Used in almost most of the basic machine learning algorithms for distance computation between points. Used when geography is not taken into consideration.
Eg: Problems involving flights data between city A to city B
2. Manhattan Distance:
Also called as rectilinear distance, city block distance, taxicab metric is defined as the sum of the lengths of the projections of the line segment between the points onto the coordinate axes.
Image you allowed to move only through edges and not diagonally between 2 opposite vertex of rectangle.
Used in almost most of the problems where traversal has constraints like graphs where traversal is restricted through edges or driving problems where roads are around blocks of houses.
3. Levenshtein Distance:
Common to text analytics and NLP problems its the number of steps required to change one string into another using insertion, deletion or updation.
Mostly used in syntax based similarity matching algorithms.
4. Hamming Distance:
Hamming distance or signal distance is a metric for comparing two binary data strings. While comparing two binary strings of equal length, Hamming distance is the number of bit positions in which the two bits are different.
Its more like XOR operation and counting the mismatches.
Eg . John and Jane will have hamming distance 3
Never used but heard its used in telecommunications and DNA sequencing.
5. Chebyshev Distance:
If you have played chess then you might have already used as King’s move which can move in all possible directions.
If you have experience with image processing, convolution layers work on the same principle of distances over image matrix.
Geometrical Distances:
1. Minkowski distance:
We already know triangle inequality: length of the hypotenuse is always less than the length of each side added up individually.
||x||+||y||≤||x||+||y||
where ||x|| is the Euclidean distance involving x coordinates and ||y|| is the Euclidean distance involving y coordinates
There are infinite possibilities that satisfy triangle inequality.
2. Cosine Distance:
This is one of my favorite. I have used in it most of my NLP and text analytics problems, the beauty of this distance is it works on the cosine angle created between 2 points in high dimensional space to capture their similarity. The lesser the angle, less will be the distance (since the cosine value will increase).
Cosine Distance= 1- d(x,y)
Used in most of the text analytics based algorithms for similarity matching or simple eg. say 2 sentences are similar or not.
Statistical Distances:
When we use group of data points together and try to measure volatility or variance against other group, then we use these kind of distances.
1. Jaccard Distance:
Based on the overlapping features between 2 group of data points, it can be used to identify similarity (in other terms distance) between groups.
X and Y are groups of data points.
Can be used on groups , say eg find how similar 2 paragraphs or documents are to each other based on word frequencies.
2. Mahalanobis Distance:
The Mahalanobis distance measures distance relative to the centroid — a base or central point which can be thought of as an overall mean for multivariate data. The centroid is a point in multivariate space where all means from all variables intersect. The larger the MD, the further away from the centroid the data point is.
Uses cases include identifying multivariate outliers, cluster analysis and classification techniques.
If you liked this post don’t forget to cheer me up with a clap, next post will be on something different.