Top 5 Distance Similarity Measures implementation in Machine Learning

Shriya Gupta
4 min readSep 30, 2019

--

Introduction

The term similarity distance measure has got a wide variety of definitions among the math and data mining practitioners. This similarity is basic block of unsupervised learning like clustering classification etc.

Similarity

The state or fact of being similar or Similarity measures how much two objects are alike. Similarity measure in a data mining context is a distance with dimensions representing features of the objects. If distance is small, two objects are very similar where as if distance is large we will observe low degree of similarity.

There are lot of similarity distance measures. But here we will look into 5 most important measures

1)Cosine Similarity:

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

Formula:

Lets see how we can do this in Scipy:

Let us also look at internal implementation of Scipy:

2) Manhattan distance:

Manhattan distance is a metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In a simple way of saying it is the total sum of the difference between the x-coordinates and y-coordinates.

Formula: In a plane with p1 at (x1, y1) and p2 at (x2, y2)

Lets see how we can do this in Scipy:

Lets also look at internal implementation of Scipy:

3) Euclidean distance:

The Euclidean distance between two points in either the plane or 3-dimensional space measures the length of a segment connecting the two points. It is the most obvious way of representing distance between two points.

The Pythagorean Theorem can be used to calculate the distance between two points, as shown in the figure below.

Formula: If the points (x1,y1)(x1,y1) and (x2,y2)(x2,y2) are in 2-dimensional space, then the Euclidean distance between them is

Lets see how we can do this in Scipy:

Lets also look at internal implementation of Scipy:

4) Minkowski distance

Minkowski distance is a generalisation of the Euclidean and Manhattan distances.

Formula: The Minkowski distance of order p between two points is defined as

Lets see how we can do this in Scipy:

Lets also look at internal implementation of Scipy:

5) Jaccard similarity:

The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient is a statistic used for gauging the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.

Formula:

Lets see how we can do this in Scipy:

Lets also look at internal implementation of Scipy:

Congratulations!! you have successfully learnt about common distance similarities in Machine Learning.

Resources:

Scipy Implementation of distance: https://github.com/scipy/scipy/blob/v0.14.1/scipy/spatial/distance.py#L199

--

--