Cosine similarity between two arrays for word embeddings

2 min readJul 11, 2023

Introduction

Cosine similarity is a measure commonly used in natural language processing (NLP) and machine learning to determine the similarity between two vectors. When working with word embeddings, which are numerical representations of words in a high-dimensional space, cosine similarity becomes a valuable tool for assessing the similarity between words or documents. In this blog post, we will explore how to calculate the cosine similarity between two arrays representing word embeddings, along with examples to demonstrate its usage.

Understanding Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors. For word embeddings, the vectors represent the semantic meaning of words in a high-dimensional space. Cosine similarity ranges from -1 to 1, with 1 indicating identical vectors, 0 indicating no similarity, and -1 indicating opposite vectors.

The cosine similarity between two vectors can be calculated using the following formula:

cosine_similarity = dot_product(a, b) / (norm(a) * norm(b))

Where

dot_product(a, b) represents the dot product between vectors a and b.
norm(a) and norm(b) represent the Euclidean norm (magnitude) of vectors a and b, respectively.

Calculating Cosine Similarity in Python

Let’s dive into an example to illustrate how to calculate the cosine similarity between two arrays representing word embeddings.

python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Word embeddings for two words
word1_embedding = np.array([0.2, 0.5, 0.8, 0.3])
word2_embedding = np.array([0.4, 0.1, 0.9, 0.5])# Reshape the arrays to match the expected input shape of cosine_similarity
word1_embedding = word1_embedding.reshape(1, -1)
word2_embedding = word2_embedding.reshape(1, -1)# Calculate cosine similarity
similarity = cosine_similarity(word1_embedding, word2_embedding)[0][0]
print(similarity)

In this example, we have two arrays, word1_embedding and word2_embedding, representing the word embeddings of two words. We reshape the arrays using reshape(1, -1) to match the expected input shape of the cosine_similarity function from scikit-learn. Finally, we calculate the cosine similarity between the two embeddings and print the result.

Output: 0.953364127643754

The cosine similarity between the two word embeddings is approximately 0.953, indicating a high degree of similarity.

Conclusion

Cosine similarity is a useful metric for assessing the similarity between word embeddings or any other vectors in NLP and machine learning tasks. By calculating the cosine similarity, you can quantify the semantic similarity between words, measure document similarity, or perform various other similarity-based analyses. In this blog post, we explored how to calculate cosine similarity between two arrays representing word embeddings using Python. By understanding and applying cosine similarity, you can enhance your NLP projects and gain insights into the relationships between words or documents.

Tutorial on JavaScript concatenate arrays with Spread Operator — Techclaw

How can I work with JSON data that produces parse errors when it contains non-zero values in arrays?

PLPGSQL: Query JSONB Array size