The distance-based algorithms in data mining

Nadeem
Analytics Vidhya
Published in
12 min readSep 12, 2021

--

The algorithms are used to measure the distance between each text and to calculate the score.

Distance measures play an important role in machine learning

They provide the foundations for many popular and effective machine learning algorithms like KNN (K-Nearest Neighbours) for supervised learning and K-Means clustering for unsupervised learning.

Different distance measures must be chosen and used depending on the types of data, As such, it is important to know how to implement and calculate a range of different popular distance measures and the intuitions for the resulting scores.

In this blog, we’ll discover distance measures in machine learning.

Overview:

  1. Role of Distance Measures
  2. Hamming Distance
  3. Euclidean Distance
  4. Manhattan Distance (Taxiable or City Block)
  5. Minkowski Distance
  6. Mahalanobis Distance
  7. Cosine Similarity

Role of Distance Measures

Distance measures play an important role in machine learning

A distance measure is an objective score that summarizes the relative difference between two objects in a problem domain.

Most commonly, the two objects are rows of data that describes a subject (such as a person, car, or house), or an event (such as purchases, a claim, or a diagnosis)

Perhaps, the most likely way we can encounter distance measures is when we are using a specific machine learning algorithm that uses distance measures at its core. The most famous algorithm is KNN — [K-Nearest Neighbours Algorithm]

KNN

A Classification or Regression prediction is made for new examples by calculating the distance between the new and all existing example sets in the training datasets.

The K examples in the training dataset with the smallest distance are then selected and a prediction is made by averaging the outcome(mode of the class label or mean of the real value for regression)

KNN belongs to a broader field of algorithms called case-based or instance-based learning, most of which uses distance measures in a similar manner. Another popular instance-based algorithm that uses distance measures is the learning vector quantization or LVQ, the algorithm that may also be considered a type of neural network.

Next, We have the Self-Organizing Map algorithm, or SOM, which is an algorithm that also uses distance measures and can be used for supervised and unsupervised learning algorithms that use distance measures at its core is the K-means clustering algorithm.

In Instance-Based Learning, the training examples are stored verbatim and a distance function is used to determine which member of the training set is closest to an unknown test instance. Once the nearest training instance has been located its class is predicted for the test instance.

A few of the More popular machine learning algorithms that use distance measures at their core is

  1. K-Nearest Neighbors (KNN)
  2. Learning Vector Quantization (LVQ)
  3. Self-Organizing Map (SOM)
  4. K-Means Clustering

There are many kernel-based methods that may also be considered distance-based algorithms.

Perhaps the most widely know kernel method is the Support Vector Machine algorithm (SVM)

When Calculating the distance between two examples or rows of data, it is possible that different data types are used for different columns of the example set.

The Example set might have real values, boolean, categorical, and ordinal values.

Different distance measures may be required for each that are summed together into a single distance score.

Numerical values may have different scales. this can greatly impact the calculation of distance measure and it is often a good practice to normalize or standardize numerical values prior to calculating the distance measure.

Numerical error in regression problems may also be considered a distance. For example error between the expected value and the predicted value is a one-dimensional distance measure that can be summed or averaged over all examples in a test set to give a total distance between the expected and predicted outcomes in the dataset.

The calculation of the errors, such as the mean squared error or mean absolute error, may resemble a standard distance measure.

As we can see, distance measures play an important role in machine learning,

the most commonly used distance measures in machine learning are

  1. Hamming Distance
  2. Euclidean Distance
  3. Manhattan Distance
  4. Minkowski Distance
  5. Mahalanobis

The most important is to know how to calculate each of these distance measures when implementing the algorithms from scratch and the intuition for what is being calculated when using algorithms that make use of these distance measures.

HAMMING DISTANCE

Hamming distance calculates the distance between two binary vectors, also referred to as binary strings or bitstrings

We are most likely going to encounter binary strings when we do One-Hot Encode categorical columns of data.

For example,

Example Set

After One Hot Encoding

An example set with one-hot encodes

the distance between red and green could be calculated as the sum or the average number of bit differences between the two bitstrings. This is Hamming distance.

For a One-hot encoded string, it might make more sense to summarize the sum of the bit difference between the strings, which will always be a 0 or 1.

  • Hamming Distance = sum for i to N abs(v1[i] — v2[i])

For bitstrings that may have many 1 bits, it is more common to calculate the average number of bit differences to give a hamming distance score between 0(identical) and 1 (all different).

  • Hamming Distance = (sum for i to N abs(v1[i] — v2[i]))/N

We can demonstrate this with an example of calculating the Hamming Distance between two bitstrings, listed below.

# calculating hamming distance between bit string
# calculate hamming distance
def hamming_distance(a,b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b)) / len(a)
# define data
row1 = [0,0,0,0,0,1]
row2 = [0,0,0,0,1,0]
# calculate distance
dist = hamming_distance(row1, row2)
print(dist)

We can see that there are two differences between the strings, or 2 out of 6 bit different, which averaged (2/6) is about 1/3 or 0.33.

0.33333333333333

we can also perform the same calculation using hamming() function from SciPy.

# calculating hamming distance between bit strings
from scipy.spatial.distance import hamming
# define data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]
# calculate distance
dist = hamming(row1, row2)
print(dist)

Here we can confirm the example we get the same results, confirming our manual implementation

0.33333333333333

Euclidean Distance

Euclidean distance calculates the distance between two real-valued vectors.

Taken from Khan Academy’s Distance Formula Tutorial

In order to calculate the distance between data points, the A and B Pythagorean theorem considers the length of the x and y-axis.

You are most likely to use Euclidean distance when calculating the distance between two rows of data that have numerical values, such a floating-point or integer values.

If columns have values with differing scales, it’s common to normalize or standardize the numerical values across all columns prior to calculating the euclidean distance. Otherwise, columns that have large values will dominate the distance measure.

Euclidean distance is calculated as the square root of the sum of the squared differences between the two vectors.

  • EuclideanDistance = sqrt(sum for i to N (v1[i] — v2[i])²)

If the distance calculation is to be performed thousands or millions of times, it is common to remove the square root operation in an effort to speed up the calculation. The resulting scores will have the same relative proportions after this modification and can still be used effectively within a machine learning algorithm for finding the most similar examples.

  • EuclideanDistance = sum for i to N (v1[i] — v2[i])²

This calculation is related to the L2 vector norm and is equivalent to the sum squared error and the root sum squared error if the square root is added.

We can demonstrate this with an example of calculating the Euclidean distance between two real-valued vectors, listed below.

# calculating euclidean distance between vectors
from math import sqrt
# calculate euclidean distance
def euclidean_distance(a, b):
return sqrt(sum((e1-e2)**2 for e1, e2 in zip(a,b)))
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = euclidean_distance(row1, row2)
print(dist)

Running the example reports the Euclidean distance between the two vectors.

6.082762530298219

We can also perform the same calculation using the euclidean() function from SciPy. The complete example is listed below.

# calculating euclidean distance between vectors
from scipy.spatial.distance import euclidean
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = euclidean(row1, row2)
print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

6.082762530298219

Manhattan Distance (Taxicab or City Block Distance)

The Manhattan distance, also called the Taxicab distance or the City Block distance, calculates the distance between two real-valued vectors.

It is perhaps more useful to vectors that describe objects on a uniform grid, like a chessboard or city blocks. The taxicab name for the measure refers to the intuition for what the measure calculates: the shortest path that a taxicab would take between city blocks (coordinates on the grid).

It might make sense to calculate Manhattan distance instead of Euclidean distance for two vectors in an integer feature space.

Manhattan distance is calculated as the sum of the absolute differences between the two vectors.

  • ManhattanDistance = sum for i to N sum |v1[i] — v2[i]|

The Manhattan distance is related to the L1 vector norm and the sum absolute error and mean absolute error metric.

We can demonstrate this with an example of calculating the Manhattan distance between two integer vectors, listed below.

# calculating manhattan distance between vectors
from math import sqrt
# calculate manhattan distance
def manhattan_distance(a, b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b))
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = manhattan_distance(row1, row2)
print(dist)

Running the example reports the Manhattan distance between the two vectors.

13

We can also perform the same calculation using the cityblock() function from SciPy. The complete example is listed below.

# calculating manhattan distance between vectors
from scipy.spatial.distance import cityblock
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = cityblock(row1, row2)
print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

13

Minkowski Distance

Minkowski distance calculates the distance between two real-valued vectors.

It is a generalization of the Euclidean and Manhattan distance measures and adds a parameter, called the “order” or “p“, that allows different distance measures to be calculated.

The Minkowski distance measure is calculated as follows:

  • EuclideanDistance = (sum for i to N (abs(v1[i] — v2[i]))^p)^(1/p)

Where “p” is the order parameter.

When p is set to 1, the calculation is the same as the Manhattan distance. When p is set to 2, it is the same as the Euclidean distance.

  • p=1: Manhattan distance.
  • p=2: Euclidean distance.

Intermediate values provide a controlled balance between the two measures.

It is common to use Minkowski distance when implementing a machine learning algorithm that uses distance measures as it gives control over the type of distance measure used for real-valued vectors via a hyperparameter “p” that can be tuned.

We can demonstrate this calculation with an example of calculating the Minkowski distance between two real vectors, listed below.

# calculating minkowski distance between vectors
from math import sqrt
# calculate minkowski distance
def minkowski_distance(a, b, p):
return sum(abs(e1-e2)**p for e1, e2 in zip(a,b))**(1/p)
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance (p=1)
dist = minkowski_distance(row1, row2, 1)
print(dist)
# calculate distance (p=2)
dist = minkowski_distance(row1, row2, 2)
print(dist)

Running the example first calculates and prints the Minkowski distance with p set to 1 to give the Manhattan distance, then with p set to 2 to give the Euclidean distance, matching the values calculated on the same data from the previous sections.

13.0
6.082762530298219

We can also perform the same calculation using the minkowski_distance() function from SciPy. The complete example is listed below.

# calculating minkowski distance between vectors
from scipy.spatial import minkowski_distance
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance (p=1)
dist = minkowski_distance(row1, row2, 1)
print(dist)
# calculate distance (p=2)
dist = minkowski_distance(row1, row2, 2)
print(dist)

Running the example, we can see we get the same results, confirming our manual implementation.

13.0
6.082762530298219

Mahalanobis Distance

Mahalanobis distance is an effective multivariate distance metric that measures the distance between a point(vector) and a distribution. It is an extremely useful metric having, excellent applications in multivariate anomaly detection, classification on highly imbalanced datasets, and one-class classification.

It has excellent applications in multivariate anomaly detection, classification on highly imbalanced datasets and one-class classification and more untapped use cases.

Considering its extremely useful applications, this metric is seldom discussed or used in stats or ML workflows. This post explains the why and the when to use Mahalanobis distance and then explains the intuition and the math with useful applications.

Mahalanobis distance is the distance between a point and a distribution. And not between two distinct points. It is effectively a multivariate equivalent of the Euclidean distance

  1. It transforms the columns into uncorrelated variables
  2. Scale the columns to make their variance equal to 1
  3. Finally, it calculates the Euclidean distance
where, 
- D^2 is the square of the Mahalanobis distance.
- x is the vector of the observation (row in a dataset),
- m is the vector of mean values of independent variables (mean of each column),
- C^(-1) is the inverse covariance matrix of independent variables.

Compute Mahalanobis Distance

import pandas as pd
import scipy as sp
import numpy as np

filepath = 'local/input.csv'
df = pd.read_csv(filepath).iloc[:, [0,4,6]]
df.head()

Let’s write the function to calculate Mahalanobis Distance:

def mahalanobis(x=None, data=None, cov=None):
"""Compute the Mahalanobis Distance between each row of x and the data
x : vector or matrix of data with, say, p columns.
data : ndarray of the distribution from which Mahalanobis distance of each observation of x is to be computed.
cov : covariance matrix (p x p) of the distribution. If None, will be computed from data.
"""
x_minus_mu = x - np.mean(data)
if not cov:
cov = np.cov(data.values.T)
inv_covmat = sp.linalg.inv(cov)
left_term = np.dot(x_minus_mu, inv_covmat)
mahal = np.dot(left_term, x_minus_mu.T)
return mahal.diagonal()

df_x = df[['A', 'B', 'C']].head(500)
df_x['maha_dist'] = mahalanobis(x=df_x, data=df[['A', 'B', 'C']])
df_x.head()

Cosine Distance:

Mostly Cosine distance metric is used to find similarities between different documents. In cosine metrics, we measure the degree of angle between two documents/vectors(the term frequencies in different documents collected as metrics). This particular metric is used when the magnitude between vectors does not matter but the orientation.

Cosine similarity formula can be derived from the equation of dot products:-

Now, you must be thinking about which value of cosine angle will be helpful in finding out the similarities.

Now that we have the values which will be considered in order to measure the similarities, we need to know what do 1, 0, and -1 signify.

Here cosine value 1 is for vectors pointing in the same direction i.e. there are similarities between the documents/data points. At zero for orthogonal vectors i.e. Unrelated(some similarity found). Value -1 for vectors pointing in opposite directions(No similarity).

sklearn.metrics.pairwise.cosine_similarity(X, Y=None, dense_output=True)

Example for Cosine Similarity

from scipy import spatial

dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)

Another Version based on Numpy

from numpy import dot
from numpy.linalg import norm

cos_sim = dot(a, b)/(norm(a)*norm(b))

Defining the Cosine Similarity function

import math
def cosine_similarity(v1,v2):
"compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
sumxx, sumxy, sumyy = 0, 0, 0
for i in range(len(v1)):
x = v1[i]; y = v2[i]
sumxx += x*x
sumyy += y*y
sumxy += x*y
return sumxy/math.sqrt(sumxx*sumyy)

v1,v2 = [3, 45, 7, 2], [2, 54, 13, 15]
print(cosine_similarity(v1,v2))

Running the example, we can see we get the same results, confirming our manual implementation.

0.972284251712

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

APIs

Articles

Summary

In this blog post, you discovered distance measures in machine learning.

Specifically, you learned:

  • The role and importance of distance measures in machine learning algorithms.
  • How to implement and calculate Hamming, Euclidean, and Manhattan distance measures, Cosine Similarity.
  • How to implement and calculate the Minkowski distance that generalizes the Euclidean and Manhattan distance measures.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Happy Learning!

--

--