Distances in Machine Learning
There are many methods to calculate distances in machine learning. Here we are going to discuss some of them.
- Euclidean distance
- Mahanta distance
- Minkowski distance
- Hamming distance
- Cosine distance & Cosine Similarity
Euclidean Distance
It is the distance between x and y in n dimension. Here, we are calculating distance d between to data points p1 and p2.
code:
from sklearn.metrics.pairwise import euclidean_distances
X = [[0, 1], [1, 1]]#distance between rows of X
euclidean_distances(X, X)#get distance to origin
euclidean_distances(X, [[0, 0]])output:
array([[1. ],
[1.41421356]])
Mahanta distance
It is the sum of absolute differences of all coordinates. Suppose we have to tell someone to distance between A to B. So, here we will say go 3 blocks straight and 3 more to left then distance will be 6 blocks.
One thing that is needed to be mention that we can’t go diagonally here.
Equation
code:
import math
p1 = [4, 0]
p2 = [6, 6]
distance = math.sqrt( ((p1[0]-p2[0])**2)+((p1[1]-p2[1])**2) )print(distance)output:
6.324555320336759
Minkowski distance
It is a distance to measure the similarity between point A to B in normed vectors space. There are 2 terms vector space, normed vector space let’s get brief in it.
- vector space- It is a collection of vectors that can be added together and multiplied by numbers like a scalar.
- Normed vector space- It is a vector space over the real or complex numbers on which a norm is defined (in a space where distances can be represented as a vector that has a length).
if see formula there are two things
- if p =1 it becomes Mahanta distance
- if p = 2 it becomes Ecludiean distance
X1 = [0,1,1,0,1,0,1,1,1]
X2 = [1,1,1,0,1,0,0,1,0]
code:
from scipy.spatial import distance
distance.minkowski([0,1,1,0,1,0,1,1,1], [1,1,1,0,1,0,0,1,0], 1)output:
3
Hamming Distance
It is used to measure distance in texts. Here we are taking a boolean vector to learn more about to hamming distance. Let’s say we have X1, X2 two boolean vectors.
Hamming distance(X1, X2) = no. of locations where binary values differ
code:
from scipy.spatial import distancedistance.hamming(['a','b','c','d'], ['d', 'b','c', 'd'])code:
0.25
Cosine distance & Cosine Similarity
Cosine Similarity is to measure similarity in two or more documents irrespective of their size. It used Cosine distance to calculate similarity.
The cosine similarity is defined as
and
Cosine Distance=1− Cosine Similarity
Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.
So, which value is useful to define what?
cos(0) = 1 , cos(360) = 1 ( there is similarity)
cos(90) = 0, cos(270) = 0( there are only few similarity :negligible)
cos(180) = -1 (not at all similarity)
code:
from scipy.spatial import distance
distance.cosine([1, 0, 0], [0, 1, 0])output:
1.0
Thanks for reading, suggestions are welcome!!!