Sankhasubhramondal
4 min readJun 12, 2020

Distances Techniques In Machine Learning

Distance in machine learning is generally used to find the similarity in between two data points.

In both supervise machine learning algorithms like k nearest algorithm and unsupervised machine learning algorithm like clustering distance is calculated for clubbing the related data points.

There are commonly used five types of distance measuring techniques in data-science:

  1. Manhattan Distance
  2. Euclidean Distance
  3. Malinowski Distance
  4. Hamming Distance
  5. Cosine similarity and Cosine Distance

Mr. X, a software engineer,took cab for traveling to his office from his flat. His home and office is situated at location_1 and location_2 respectively. The distance he covered to travel to go to office and vice-versa is the example of Manhattan distance and the straight distance between the location_1 and location_2 is the example of Euclidean distance.

fig1: Manhattan Distance

Manhattan Distance :

Manhattan distance, backbone of taxicab geometry is calculated by finding the absolute distance between two points (parallel to axes at an angle of 90 degree always). As it is calculated always in parallel to axes at right angles it’s known as rectilinear distance also. Mainly used in regression analysis. for higher dimension data we use Manhattan distance to find the similar data points.

Euclidean Distance :

Manhattan distance is not the shortest distance between two points. The straight line distance between two points are treated as euclidean distance.

fig3: Malinowski Distance

Malinowski Distance :

This is the generalized metric distance. When k = 1, it becomes Manhattan distance (city block distance) and when k = 2, it becomes Euclidean distance.

Manhattan, Euclidean, Malinowski distances are also known as L1,L2 & Lp norms respectively. in Malinowski distance (Lp)When p = 1, it’s refers to Manhattan distance .and When p = 2, it’s refers to Euclidean distance.

Manhattan, Euclidean and Malinowski Distance are mainly used to calculate the distances between continuous variable.

fig4: Hamming Distance

Hamming Distance :

Hamming distance is used to calculate the distances between categorical variables(sometimes called it nominal variable). Number of difference between two binary strings is represented using Hamming distance. For categorical variable there is no existence of order. This particular property of categorical variable insists to calculate the change of categorical values in respect of binary values.

fig5: Hamming Distance

Let’s discuss fig5, here the column Gender, Student, Nationality are converted to numeric figure which helps to construct column code for every user. These code helps to find the distance between two data points. Now we can calculate the the distance by calculating the change in bits present in the code. For u_1 and u_2 we have change in 3 bits (111converted to 002 → 3 bits changed) so the distance is 3. For u_2 and u_3 the distance is 3(002 converted to 113 → 3 bits changed), but for u_3 and u_1 the distance is 1 (113 converted to 111 → 1 bit changed).

fig6: Cosine Distance

Cosine Similarity and Cosine Distance :

Cosine similarity and Cosine distance depends upon the relation between distance and similarity. If distance between two data points gets increases they are said to be dis-similar(distance is inversely proportional to similarity here) . Angle between two data points measures the similarity between those two points. If angle between them is higher than they are more prone to dis-similar and if angle between them is lower then are prone to similar. As the value of cos lies between -1 to 1, similarity measurement values are always in the range of -1 to 1.

Conclusion :

Manhattan distance technique is in higher dimension data sets, Euclidean distance technique is used in lower dimension data sets, hamming distance technique in categorical data sets and Cosine similarity and Cosine Distance technique is used for recommendation systems.

Suggestion :

Distance measurement techniques face problems (don’t work good) when the variables lies in different scales like salary, age. As the relative distance between the points gets changed frequently for presence of different scaling. So scaling of the variables should be implemented before applying any distance based algorithm.