Natural Language Processing (Part 28)-Cosine Similarity: Intuition

📚Chapter 3: Vector Space Model


In this tutorial, you’re going to learn about cosine similarity, which is another type of similarity function. It basically makes use of the cosine of the angle between two vectors. Based on that, it tells you whether two vectors are close or not. In this section, you will see the problem of using Euclidean distance, especially when comparing vector representations of documents or corpora, and how the cosine similarity metric could help you overcome this problem.


Euclidean distance
Cosine Similarity
Cosine distance using Python

Section 1- Euclidean distance

To illustrate how the Euclidean distance might be problematic, let’s take the following example. Suppose that you are in a vector space where the corpora are represented by the occurrence of the words disease and eggs. Here’s the representation of a food corpus, and agriculture corpus, and the history corpus. Each one of these corpora have texts related to that subject. But you know that the word totals in the corpora differ from one another. In fact, the agriculture and the history corpus have a similar number of words, while the food corpus has a relatively small number. Let’s define the Euclidean distance between the food and the agriculture corpus as d_1 and let’s the Euclidean distance between the agriculture and the history corpus be d_2. As you can see, the distance d_2 is smaller than the distance d_1, which would suggest that the agriculture and history corpora are more similar than the agriculture and food corpora.

Section 2- Cosine Similarity

Cosine distance looks at the angle between vectors of an inner product space. So, it’s determining whether vectors are pointing in roughly the same direction. But cosine distance can be used when the magnitude of the vectors does not matter.

Another common method for determining the similarity between vectors is computing the cosine of their inner angle. If the angle is small, the cosine would be close to one. As the angle approaches 90 degrees, the cosine approaches zero. As you can see here, the angle Alpha between food and agriculture is smaller than the angle Beta between agriculture and history. In this particular case, the cosine of those angles is a better proxy of similarity between these vector representations than their Euclidean distance.


Let’s assume that (5,3) and (2,4) are two points in a 2D plane.

(a . b) = (5*2) + (3*4) = 10 + 12 = 22

|a| = √ (52 + 32) = 5.83

|b| = √ (22 + 42) = 4.47

Using cosine distance formula,

d = 1–22 / (5.83 * 4.47)

d = 1–0.844

d = 0.156

Note: if θ = 0,

distance = 1 — cos θ

= 1–1

= 0

Section 3- Cosine distance using Python

from scipy.spatial import distance
A = (5, 3)
B = (2, 4)
d = 1 - distance.cosine(A, B)
print('Cosine Distance:',d)OUTPUT:
Cosine Distance: 0.8436614877321075


Now you’re familiar with the main intuition behind the use of cosine similarity as a metric to compare the similarity between two vector representations. Remember that the main advantage of this metric over the Euclidean distance is that it isn’t biased by the size difference between the representations. Soon, you’ll get the chance to actually calculate this metric. In this tutorial, you learned why the cosine similarity metric is useful. If you have two documents of very different sizes, then taking the Euclidean distance is not ideal. The cosine similarity used the angle between the documents and is thus not dependent on the size of the corpuses.

