Natural Language Processing (Part 28)-Cosine Similarity: Intuition

Coursesteach
4 min readJan 28, 2024

--

📚Chapter 3: Vector Space Model

Introduction

In this tutorial, you’re going to learn about cosine similarity, which is another type of similarity function. It basically makes use of the cosine of the angle between two vectors. Based on that, it tells you whether two vectors are close or not. In this section, you will see the problem of using Euclidean distance, especially when comparing vector representations of documents or corpora, and how the cosine similarity metric could help you overcome this problem.

Sections

Euclidean distance
Cosine Similarity
Cosine distance using Python
Summary

Section 1- Euclidean distance

To illustrate how the Euclidean distance might be problematic, let’s take the following example. Suppose that you are in a vector space where the corpora are represented by the occurrence of the words disease and eggs. Here’s the representation of a food corpus, and agriculture corpus, and the history corpus. Each one of these corpora have texts related to that subject. But you know that the word totals in the corpora differ from one another. In fact, the agriculture and the history corpus have a similar number of words, while the food corpus has a relatively small number. Let’s define the Euclidean distance between the food and the agriculture corpus as d_1 and let’s the Euclidean distance between the agriculture and the history corpus be d_2. As you can see, the distance d_2 is smaller than the distance d_1, which would suggest that the agriculture and history corpora are more similar than the agriculture and food corpora.

Section 2- Cosine Similarity

Cosine distance looks at the angle between vectors of an inner product space. So, it’s determining whether vectors are pointing in roughly the same direction. But cosine distance can be used when the magnitude of the vectors does not matter.

Another common method for determining the similarity between vectors is computing the cosine of their inner angle. If the angle is small, the cosine would be close to one. As the angle approaches 90 degrees, the cosine approaches zero. As you can see here, the angle Alpha between food and agriculture is smaller than the angle Beta between agriculture and history. In this particular case, the cosine of those angles is a better proxy of similarity between these vector representations than their Euclidean distance.

Example:

Let’s assume that (5,3) and (2,4) are two points in a 2D plane.

(a . b) = (5*2) + (3*4) = 10 + 12 = 22

|a| = √ (52 + 32) = 5.83

|b| = √ (22 + 42) = 4.47

Using cosine distance formula,

d = 1–22 / (5.83 * 4.47)

d = 1–0.844

d = 0.156

Note: if θ = 0,

distance = 1 — cos θ

= 1–1

= 0

Section 3- Cosine distance using Python

from scipy.spatial import distance
A = (5, 3)
B = (2, 4)
d = 1 - distance.cosine(A, B)
print('Cosine Distance:',d)OUTPUT:
Cosine Distance: 0.8436614877321075

Summary

Now you’re familiar with the main intuition behind the use of cosine similarity as a metric to compare the similarity between two vector representations. Remember that the main advantage of this metric over the Euclidean distance is that it isn’t biased by the size difference between the representations. Soon, you’ll get the chance to actually calculate this metric. In this tutorial, you learned why the cosine similarity metric is useful. If you have two documents of very different sizes, then taking the Euclidean distance is not ideal. The cosine similarity used the angle between the documents and is thus not dependent on the size of the corpuses.

Please Follow and 👏 Clap for the story courses teach to see latest updates on this story

If you want to learn more about these topics: Python, Machine Learning Data Science, Statistic For Machine learning, Linear Algebra for Machine learning Computer Vision and Research

Then Login and Enroll in Coursesteach to get fantastic content in the data field.

Stay tuned for our upcoming articles where we will explore specific topics related to NLP in more detail!

Remember, learning is a continuous process. So keep learning and keep creating and sharing with others!💻✌️

Note:if you are a NLP export and have some good suggestions to improve this blog to share, you write comments and contribute.

if you need more update about NLP and want to contribute then following and enroll in following

👉Course: Natural Language Processing (NLP)

👉📚GitHub Repository

👉 📝Notebook

Do you want to get into data science and AI and need help figuring out how? I can offer you research supervision and long-term career mentoring.
Skype: themushtaq48, email:mushtaqmsit@gmail.com

Contribution: We would love your help in making coursesteach community even better! If you want to contribute in some courses , or if you have any suggestions for improvement in any coursesteach content, feel free to contact and follow.

Together, let’s make this the best AI learning Community! 🚀

👉WhatsApp

👉 Facebook

👉Github

👉LinkedIn

👉Youtube

👉Twitter

Source

1- Natural Language Processing with Classification and Vector Spaces

2-Cosine similarity, cosine distance explained | Math, Statistics for data science, machine learning

3-Exploring the Power of Minkowski Distance in Data Analysis

--

--