Distances in Machine Learning

Namratesh Shrivastav
Analytics Vidhya
Published in
4 min readJan 5, 2020

There are many methods to calculate distances in machine learning. Here we are going to discuss some of them.

  • Euclidean distance
  • Mahanta distance
  • Minkowski distance
  • Hamming distance
  • Cosine distance & Cosine Similarity

Euclidean Distance

It is the distance between x and y in n dimension. Here, we are calculating distance d between to data points p1 and p2.

code:

from sklearn.metrics.pairwise import euclidean_distances
X = [[0, 1], [1, 1]]
#distance between rows of X
euclidean_distances(X, X)
#get distance to origin
euclidean_distances(X, [[0, 0]])
output:
array([[1. ],
[1.41421356]])

Mahanta distance

It is the sum of absolute differences of all coordinates. Suppose we have to tell someone to distance between A to B. So, here we will say go 3 blocks straight and 3 more to left then distance will be 6 blocks.

One thing that is needed to be mention that we can’t go diagonally here.

Equation

code:

import math
p1 = [4, 0]
p2 = [6, 6]
distance = math.sqrt( ((p1[0]-p2[0])**2)+((p1[1]-p2[1])**2) )
print(distance)output:
6.324555320336759

Minkowski distance

It is a distance to measure the similarity between point A to B in normed vectors space. There are 2 terms vector space, normed vector space let’s get brief in it.

  • vector space- It is a collection of vectors that can be added together and multiplied by numbers like a scalar.
  • Normed vector space- It is a vector space over the real or complex numbers on which a norm is defined (in a space where distances can be represented as a vector that has a length).

if see formula there are two things

  • if p =1 it becomes Mahanta distance
  • if p = 2 it becomes Ecludiean distance

X1 = [0,1,1,0,1,0,1,1,1]

X2 = [1,1,1,0,1,0,0,1,0]

code:

from scipy.spatial import distance
distance.minkowski([0,1,1,0,1,0,1,1,1], [1,1,1,0,1,0,0,1,0], 1)
output:
3

Hamming Distance

It is used to measure distance in texts. Here we are taking a boolean vector to learn more about to hamming distance. Let’s say we have X1, X2 two boolean vectors.

Hamming distance(X1, X2) = no. of locations where binary values differ

code:

from scipy.spatial import distancedistance.hamming(['a','b','c','d'], ['d', 'b','c', 'd'])code:
0.25

Cosine distance & Cosine Similarity

Cosine Similarity is to measure similarity in two or more documents irrespective of their size. It used Cosine distance to calculate similarity.

The cosine similarity is defined as

and

Cosine Distance=1− Cosine Similarity

Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.

So, which value is useful to define what?

cos(0) = 1 , cos(360) = 1 ( there is similarity)

cos(90) = 0, cos(270) = 0( there are only few similarity :negligible)

cos(180) = -1 (not at all similarity)

code:

from scipy.spatial import distance
distance.cosine([1, 0, 0], [0, 1, 0])
output:
1.0

notebook attached here.

Thanks for reading, suggestions are welcome!!!

References:

https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d

--

--