Euclidean Distance and Cosine Similarity. Which One to Use and When?

Gueter Josmy Faure
Sep 3, 2020 · 4 min read
Image for post
Image for post
Photo by Markus Winkler on Unsplash

This is a quick and straight to the point introduction to Euclidean distance and cosine similarity with a focus on NLP.

Euclidean Distance

The Euclidean distance metric allows you to identify how far two points or two vectors are apart from each other.

Now suppose you are a high school student and you have three classes. A math class, a philosophy class, and a psychology class. You want to check the similarity between these classes based on the words your professors use in class. For the sake of simplicity, let’s consider these two words: “theory” and “harmony”. You could then create a table like this to record the occurrence of these words in each class:

Image for post
Image for post

In this table, the word “theory” is repeated 60 times in math class, 20 times in philosophy class, and 25 times in psychology class whereas the word harmony is repeated 10, 40, and 70 times in math, philosophy, and psychology classes respectively. Let’s translate this data into a 2D plane.

Word vectors in 2D plane
Word vectors in 2D plane

The Euclidean distance is simply the distance between the points as shown in the graph below:

Image for post
Image for post

You can see clearly that d1 which is the distance between psychology and philosophy is smaller than d2 which is the distance between philosophy and math. But how do you calculate d1 and d2?

The generic formula is the following:

Image for post
Image for post

In our case, for d1, d(v, w) = d(philosophy, psychology), which is:

Image for post
Image for post

And d2

Image for post
Image for post

As expected d2 > d1.

How to do this in python?

import numpy as np# define the vectorsmath = np.array([60, 10])philosophy = np.array([20, 40])psychology = np.array([25, 70])# calculate d1d1 = np.linalg.norm(philosophy - psychology)# calculate d2d2 = np.linalg.norm(philosophy - math)

Cosine Similarity

Suppose you only have 2 hours of psychology class per week and 5 hours of both math class and philosophy class. Because you attend more of these two classes, the occurrence of the words “theory” and “harmony” will be greater than for the psychology class. Thus the updated table:

Image for post
Image for post

And the updated 2D graph:

Image for post
Image for post

Using the formula we’ve given earlier for Euclidean distance, we will find that, in this case, d1 is greater than d2. But we know psychology is closer to philosophy than it is to math. The frequency of the courses tricks the Euclidean distance metric. Cosine similarity is here to solve this problem.

Instead of calculating the straight line distance between the points, cosine similarity cares about the angle between the vectors.

Image for post
Image for post

Zooming in on the graph, we can see that the angle α, is smaller than the angle β. That’s all cosine similarity wants to know. In other words, the smaller the angle, the closer the vectors are to each other.

The generic formula goes as follows

Image for post
Image for post

β is the angle between the vectors philosophy (represented by v) and math (represented by w).

Image for post
Image for post

Whereas cos(alpha) = 0.99 which is higher than cos(beta) meaning philosophy is closer to psychology than it is to math.

Recall that

Image for post
Image for post

and

Image for post
Image for post

This implies that the smaller the angle, the greater your cosine similarity will be and the greater your cosine similarity, the more similar your vectors are.

Python implementation

import numpy as npmath = np.array([80, 45])philosophy = np.array([50, 60])psychology = np.array([15, 20])cos_beta = np.dot(philosophy, math) / (np.linalg.norm(philosophy) * np.linalg.norm(math))print(cos_beta)

Takeaway

I bet you know by now how Euclidean distance and cosine similarity work. The former considers the straight line distance between two points whereas the latter cares about the angle between the two vectors in question.

Euclidean distance is more straightforward and is guaranteed to work whenever your features distribution is balanced. But most of the time, we deal with unbalanced data. In such cases, it’s better to use cosine similarity. Also, cosine similarity will work well whether or not your data distribution is balanced.

The Startup

Medium's largest active publication, followed by +773K people. Follow to join our community.

Gueter Josmy Faure

Written by

Graduate student in computer science and Machine Learning enthusiast. I write just to learn. portfolio : https://joslefaure.github.io/portfolio/

The Startup

Medium's largest active publication, followed by +773K people. Follow to join our community.

Gueter Josmy Faure

Written by

Graduate student in computer science and Machine Learning enthusiast. I write just to learn. portfolio : https://joslefaure.github.io/portfolio/

The Startup

Medium's largest active publication, followed by +773K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store