Understanding Euclidean Distance and Cosine_Similartiy

Vikram Ojha
Analytics Vidhya
Published in
3 min readApr 3, 2020

--

Photo by Paolo Nicolello on Unsplash

Finding similarity is one of the most fascinating ideas of NLP, here the idea is to find how similar two sentences are to each other, or how similar the given two images or documents or two voices are to other.

There are 5 popular techniques which are mentioned here

Here we will start with Euclidean distance and then will focus mainly on coisne_similarity. I will also present code for these two methods in python.

Euclidean Distance

In Euclidean distance we basically find the distance between the two points, using Pythagorean theorem, smaller the Euclidean distance between two points there’s more similarity between those two points

As we can see from above table, Euclidean distance between two extreme points i.e p1 and p4 is 5.099 and nearby points i.e p2 and p3 is 1.414

The beauty of Euclidean distance is that it helps us to determine the distance between in n-dimensional space as well.

Code

Euclidean Distance, python Implementation

or we can implement the same above euclidean_distance using list comprehension

Euclidean distance using list comprehension

Cosine Similarity

So, as we can see here from above figure in cosine_similarity as the name suggest we find the cosine of the angle between the two points

Higher the value of cosine_similarity, more similar those two points are in vector space.

Cosine Similarity is basically used to find the similarity between two documents or two sentences. Now, lets suppose two documents A & B documents is snippet of other documents, A ⊆ B, then if we select a word say cricket which is common in both documents, it is most likely that number of times cricket in A will be quite less than document B. Here the value of Euclidean distance may be misleading because of huge difference, so in this case we we would go for cosine similarity as it helps us to solve this problem.

Code

taking same point in vector space

In this article, I have left some open points why cosine_similarity works or how it neutralizes when document A ⊆ B, but word count is different. I will explain this in my next article.

Thanks

--

--

Vikram Ojha
Analytics Vidhya

Senior R&D Engineer , ABB | Data Science Enthusiast |Areas of interest Transfer Learning, NLP, Deep Learning https://www.linkedin.com/in/vikramojha/