This is a quick and straight to the point introduction to Euclidean distance and cosine similarity with a focus on NLP.
The Euclidean distance metric allows you to identify how far two points or two vectors are apart from each other.
Now suppose you are a high school student and you have three classes. A math class, a philosophy class, and a psychology class. You want to check the similarity between these classes based on the words your professors use in class. For the sake of simplicity, let’s consider these two words: “theory” and “harmony”. You could then create a table like this to record the occurrence of these words in each class:
In this table, the word “theory” is repeated 60 times in math class, 20 times in philosophy class, and 25 times in psychology class whereas the word harmony is repeated 10, 40, and 70 times in math, philosophy, and psychology classes respectively. Let’s translate this data into a 2D plane.
The Euclidean distance is simply the distance between the points as shown in the graph below:
You can see clearly that d1 which is the distance between psychology and philosophy is smaller than d2 which is the distance between philosophy and math. But how do you calculate d1 and d2?
The generic formula is the following:
In our case, for d1,
d(v, w) = d(philosophy, psychology), which is:
As expected d2 > d1.
How to do this in python?
import numpy as np# define the vectorsmath = np.array([60, 10])philosophy = np.array([20, 40])psychology = np.array([25, 70])# calculate d1d1 = np.linalg.norm(philosophy - psychology)# calculate d2d2 = np.linalg.norm(philosophy - math)
Suppose you only have 2 hours of psychology class per week and 5 hours of both math class and philosophy class. Because you attend more of these two classes, the occurrence of the words “theory” and “harmony” will be greater than for the psychology class. Thus the updated table:
And the updated 2D graph:
Using the formula we’ve given earlier for Euclidean distance, we will find that, in this case, d1 is greater than d2. But we know psychology is closer to philosophy than it is to math. The frequency of the courses tricks the Euclidean distance metric. Cosine similarity is here to solve this problem.
Instead of calculating the straight line distance between the points, cosine similarity cares about the angle between the vectors.
Zooming in on the graph, we can see that the angle α, is smaller than the angle β. That’s all cosine similarity wants to know. In other words, the smaller the angle, the closer the vectors are to each other.
The generic formula goes as follows
β is the angle between the vectors philosophy (represented by v) and math (represented by w).
cos(alpha) = 0.99 which is higher than
cos(beta) meaning philosophy is closer to psychology than it is to math.
This implies that the smaller the angle, the greater your cosine similarity will be and the greater your cosine similarity, the more similar your vectors are.
import numpy as npmath = np.array([80, 45])philosophy = np.array([50, 60])psychology = np.array([15, 20])cos_beta = np.dot(philosophy, math) / (np.linalg.norm(philosophy) * np.linalg.norm(math))print(cos_beta)
I bet you know by now how Euclidean distance and cosine similarity work. The former considers the straight line distance between two points whereas the latter cares about the angle between the two vectors in question.
Euclidean distance is more straightforward and is guaranteed to work whenever your features distribution is balanced. But most of the time, we deal with unbalanced data. In such cases, it’s better to use cosine similarity. Also, cosine similarity will work well whether or not your data distribution is balanced.