To Cosine or Not to Cosine, That Is the Question: Understanding Similarity Metrics

AlexPodles
10 min readApr 1, 2024

--

Introduction

In the dynamic field of AI and Machine Learning, a key focus is on algorithms that specialize in embeddings and similarity measures. These algorithms are key to search engines and recommendation systems, tasked with the complex challenge of identifying similarities among vast data sets.

For example, in recommendation systems, accurately measuring the similarity between items can lead to more personalized and relevant suggestions for the user. Similarly, in natural language processing, similarity measures help in tasks such as document classification and sentiment analysis by comparing text data. By effectively navigating high-dimensional data and context to deliver accurate, relevant results, these measures play a pivotal role in enhancing user experiences across digital platforms, from improving search engine results to tailoring content in social media feeds.

This process involves two straightforward steps: first, converting real-world objects into numerical representations (embeddings) that machines can process; second, defining a function (comparator) that calculates the distance or similarity between these embeddings. With just these steps and a few lines of code using sentence-transformerswe can easily implement a basic similarity search system.

# pip install -q sentence-transformers

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
"I like eating ice-cream",
"Peter goes to work every day",
"Cats and dogs are good friends"
]

my_sentece = "Watermelon is a healthy food"

# Step 1: agree on sentences interpretation
embeddings1 = model.encode(sentences, convert_to_tensor=True)
embeddings2 = model.encode(my_sentece, convert_to_tensor=True)

# Step 2: calculate distances using cosine as comparator
cosine_scores = util.dot_score(embeddings1, embeddings2)

# Output their score
for i in range(len(sentences)):
print("{} \t\t {} \t\t Score: {:4f}".format(
sentences[i], my_sentece, cosine_scores[i][0]
))

As an output, you can see something like this:

+---------------------------------------+-----------------------------------+----------+
| Sentence | Comparison Target | Score |
+---------------------------------------+-----------------------------------+----------+
| I like eating ice-cream | Watermelon is a healthy food | 0.271034 |
| Peter goes to work every day | Watermelon is a healthy food | 0.065666 |
| Cats and dogs are good friends | Watermelon is a healthy food | 0.160742 |
+---------------------------------------+-----------------------------------+----------+

The results are promising given the effort invested. To fully grasp the mechanics, let’s dive into the reasoning behind each step.

Step 1 involves creating embeddings, a crucial process because machines and humans process information differently. We need to establish a common representation of objects by building a transformation function. This function captures essential aspects and relationships of the objects, converting them into vector representations that machines can interpret.

Vector representation in the imaginary world

For example, in a simplified model, a cat or dog’s characteristics — such as ‘cuteness,’ ‘likeliness to eat,’ ‘sleepiness,’ and ‘likeliness to bark’ — can be quantitatively represented in a four-dimensional vector, such as [34, 21, 44, 2] for a cat and [34, 50, 35, 56] for a dog. These vectors allow objects to be represented in a multi-dimensional space, facilitating the comparison and understanding of their characteristics by machines.

Often, embeddings are far more complex than our simple four-dimensional example. For instance, OpenAI’s text-embedding-3-small has 1536 dimensions, while Google’s textembedding-gecko offers a 768-dimensional vector. These high-dimensional embeddings lead to a more nuanced ‘understanding’ of the world.

Step 2 focuses on comparing the distances between object vectors to determine their similarity, utilizing various well-known mathematical operations. Among the most basic algorithms for this are cosine similarity, the dot product, and Euclidean distance.

Cosine similarity evaluates how closely two vectors align in their direction within a vector space, using the cosine of the angle between them as a measure. This method is particularly useful for determining the similarity between vectors in high-dimensional spaces, such as those used in text analysis or when comparing document embeddings. The key points are:

  • When vectors point in the same direction (are highly similar), the angle between them is 0°, resulting in a cosine similarity of cos(0°) = 1.
  • If vectors point in opposite directions (are highly dissimilar), their angle is 180°, giving cos(180°) = -1.
  • For orthogonal vectors, which represent no similarity, the angle is 90°, leading to cos(90°) = 0.

Cosine similarity thus provides a normalized measure of vector orientation, with values ranging from -1 (complete dissimilarity) to 1 (identical direction), offering a nuanced understanding of similarity that does not consider magnitude. This makes it an invaluable tool for comparing the conceptual similarity between objects represented as vectors, irrespective of their size.

Important to note, that the cosine similarity between v and w is the same as between k and w, since cosine similarity only considers vector orientation.

The dot product, while similar to cosine similarity, is essentially an unnormalized measure of vector similarity. It calculates how much one vector extends in the direction of another, essentially measuring the vectors’ alignment. The more aligned or similar two vectors are, the higher their dot product. In contrast to cosine similarity, which normalizes this value to account only for direction (resulting in a range from -1 to 1), the dot product reflects both direction and magnitude. This measure is particularly useful for understanding vector similarity in terms of projection: the extent to which one vector ‘projects’ onto another. For example, if two vectors are pointing in the same direction, their dot product reaches its maximum, indicating high similarity or alignment.

In this case, similarity between k and w is more than between v and w because the magnitude of the vectors is also taken into consideration.

Euclidean distance calculates the straight-line distance between two vectors in n-dimensional space. While effective in low-dimensional spaces, its usefulness diminishes in higher dimensions due to the curse of dimensionality, which can make distances in multi-dimensional space less meaningful.

Case Study

When deciding which similarity metric to utilize, one might commonly encounter recommendations for cosine similarity. But is it the go-to metric for all similarity measurements? The short answer is no; the long answer is, it depends on various factors, including the structure of our embeddings.

We’ll explore this further in the next case study.

We will develop a method to identify the two most similar colors in a given collection, applying our understanding of similarity metrics in a real-world context.

RGB color model is a good starting point here as many embedding models are high-dimensional and understanding their mechanics can be challenging. In contrast, the RGB model simplifies matters by situating us within a more common 3-dimensional space.

In the RGB model, every color is a combination of three primary colors: red, green, and blue. This simple, 3-dimensional space allows for straightforward embedding creation, easy calculation and interpretation of distances, and intuitive comparison of color similarities.

With all that theoretical foundation, let’s move to a practical application.

While we are working with colors, let's define a simple helper function that will help us to visualize embeddings:

from IPython.display import HTML, display

def display_colored_square(rgb, size=100):
r = rgb[0]
g = rgb[1]
b = rgb[2]
color = f'rgb({r},{g},{b})'
square_style = f'width: {size}px; height: {size}px; background-color: {color};'
display(HTML(f'<div style="{square_style}"></div>'))

We will begin with the next simple colors:

import numpy as np

colors = [
np.array([250, 180, 3]),
np.array([120, 100, 2]),
np.array([120, 130, 2]),
np.array([100, 90, 90])
]

for color in colors:
display_colored_square(color)

Now, let’s proceed to calculate similarities among our color embeddings, using three metrics: cosine similarity, the dot product, and Euclidean distance:

from numpy import dot
from numpy.linalg import norm

result = []
for i in range(len(colors)):
for j in range(i + 1, len(colors)):
vec1 = colors[i]
vec2 = colors[j]

dot_product = dot(vec1, vec2) # higher is better
cosine_similarity = dot_product/(norm(vec1)*norm(vec2)) # higher is better
euclidean = norm(vec1-vec2) # lower is better
result.append([vec1, vec2, cosine_similarity, dot_product, euclidean])

res = max(result, key=lambda row: row[2])
display_colored_square(res[0])
display_colored_square(res[1])
print(f"Cosine similarity of vectors: {res[2]}")
print("\n")

res = max(result, key=lambda row: row[3])
display_colored_square(res[0])
display_colored_square(res[1])
print(f"Dot product of vectors: {res[3]}")
print("\n")

res = min(result, key=lambda row: row[4])
display_colored_square(res[0])
display_colored_square(res[1])
print(f"Euclidean distance of vectors: {res[4]}")
print("\n")

Let’s check the most similar results returned by each metric:

Colors and their similarity

Interestingly, each similarity metric interprets the closeness of colors differently. Cosine similarity and the dot product, being closely related, suggest a close level of similarity that doesn’t quite match our visual expectations, even when indicating nearly 99.74% similarity or top scalar dot product value.

In contrast, the results from Euclidean distance tend to align more with how we visually perceive color similarity. Why does it seem to mirror our visual interpretation more closely? Let’s investigate further.

The main limitation of using cosine similarity and the dot product for color comparisons lies in their primary focus on the proportionality of [R, G, B] components, rather than their absolute color values. Cosine similarity calculates the cosine of the angle between two vectors, effectively measuring their directional alignment without regard to magnitude. This means that even if one color is merely a brighter or darker version of another (indicative of the same hue but varying in saturation or brightness), they will have a high cosine similarity score, suggesting identical hues but overlooking critical aspects like brightness and saturation.

While the dot product takes into account the vectors’ magnitudes, its value is influenced by both the vectors’ directions and their lengths. Consequently, two color vectors pointing in the same direction but of different lengths — representing different overall intensities — will still produce a significant dot product. This suggests an alignment in their color proportions, not a comprehensive measure of similarity.

On the other hand, Euclidean distance approaches color comparison geometrically, measuring the physical distance between two points in the RGB space. This metric evaluates differences across all three RGB components simultaneously, capturing the total variance in color. It mirrors how we perceive color differences — considering hue, saturation, and brightness — thereby providing a more accurate reflection of color similarity as experienced in the real world. Understanding these distinctions is crucial for applications requiring precise color matching, highlighting Euclidean distance’s value in capturing the full spectrum of color differences.

To better grasp the concept of proportionality between vectors and how it affects similarity metrics, let’s visualize proportional and non-proportional vectors in a 2D space. This simplification helps illustrate the principle that even when vectors have different magnitudes (sizes), their directionality can indicate a high level of similarity according to certain metrics.

import numpy as np
import matplotlib.pyplot as plt

def print_diff(vec_a, vec_b):
print(f"Measuring {vec_a} and {vec_b}")
dot_product = np.dot(vec_a, vec_b)
cos_sim = dot_product / (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))
euclidean = np.linalg.norm(vec_a-vec_b)
print(f"Dot Product: {dot_product}")
print(f"Cosine Similarity: {cos_sim:.2f}")
print(f"Euclidean: {euclidean}")
print("\n")

vec_a = np.array([2, 4])
vec_b = np.array([1, 2])
vec_c = np.array([1, 1])

fig, ax = plt.subplots()
ax.quiver(0, 0, vec_a[0], vec_a[1], angles='xy', scale_units='xy', scale=1, color="red", label=f"Vec A: {vec_a}")
ax.quiver(0, 0, vec_b[0], vec_b[1], angles='xy', scale_units='xy', scale=1, color="blue", label=f"Vec B: {vec_b}")
ax.quiver(0, 0, vec_c[0], vec_c[1], angles='xy', scale_units='xy', scale=1, color="green", label=f"Vec C: {vec_c}")

plt.xlim(-1, max(vec_a[0], vec_b[0]) + 1)
plt.ylim(-1, max(vec_a[1], vec_b[1]) + 1)
plt.axhline(y=0, color='k')
plt.axvline(x=0, color='k')
plt.grid(True, which='both')
plt.legend()

plt.title("Visualization of Vector Proportionality")
plt.xlabel("Channel 1")
plt.ylabel("Channel 2")
plt.gca().set_aspect('equal', adjustable='box')

plt.show()

print_diff(vec_a, vec_b)
print_diff(vec_a, vec_c)
print_diff(vec_b, vec_c)

It illustrates our scenario — proportional vectors ([2, 4] and [1,2]) scoring higher in cosine similarity compared to vectors that are physically closer to each other ([1,2] and [1,1]), despite the intuitive expectation of the latter indicating greater similarity.

Vectors similarity in 2D space

There are few keypoints that can help us to understand the proper way of similarity measurement:

  1. Understanding the Embedding Space. Geometric properties are important here. The structure of the embedding space — whether it’s linear, spherical, or something else — greatly influences which similarity metric is most appropriate. For example, in spaces where directionality is more informative than magnitude, cosine similarity, which measures the cosine of the angle between two vectors, might be more appropriate than Euclidean distance. The scale and density within the embedding space can also affect the choice of metric. Metrics that are sensitive to scale might not perform well if the embeddings are not normalized.
  2. Similarity definition. What shapes similarity in the context of a specific application? In case we are looking for similar-by-proportion colors we can still be happy with cosine similarity. But in case we are looking for something more visually similar it might be not what we need.

Conclusion

To summarize, selecting a similarity metric for comparing embeddings is not a one-size-fits-all problem. It requires a deep understanding of the properties of the embedding space, a clear definition of what constitutes similarity in the specific context, and consideration of practical constraints related to computation and scalability. By carefully analyzing these factors, one can choose a metric that not only captures the intended notion of similarity but also performs well in practice, thereby enabling more effective and meaningful comparisons between the entities represented by the embeddings.

--

--