All Pairs Cosine Similarity in PyTorch

PyTorch defines a cosine_similarity function to compute pairwise cosine similarity between pairs of vectors. However, there’s no method to compute the cosine similarity between every pair of vectors in a list of vectors. We’ll explore a very simple and efficient way to do this in PyTorch.

Dhruv Matani
7 min readJun 8, 2023

Co-authored with Naresh and Gaurav.

Cosine similarity formula. Source: Cosine Similarity (Wikipedia)

Table of contents

Introduction

  1. PyTorch API for Cosine Similarity
  2. Where is cosine similarity typically used in Machine Learning (ML)?

Computing all pairs Cosine Similarity in PyTorch

  1. Indexing a tensor with None
  2. Expanding a tensor using .expand(…)
  3. All-pairs cosine similarity
  4. The last trick: Broadcasting
  5. A note on efficiency

Conclusion

Introduction

From Wikipedia,

In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval [−1,1].

PyTorch API for Cosine Similarity

torch.nn.functional.cosine_similarity(x1, x2, dim=1, eps=1e-8) -> Tensor

This computes the pairwise cosine similarity between x1 and x2 along a specified dimension. I.e. if x1 and x2 have shape (10, 4, 5) each and we wish to compute the cosine similarity along the last dimension (with value 5), the result will have shape (10, 4).

For example,

x, y = torch.randn(10, 4, 5), torch.randn(10, 4, 5)
print(F.cosine_similarity(x, y, dim=2).shape)

Would print

torch.Size([10, 4])

This is because when we provide cosine_similarity a 3d tensor and ask it to run cosine similarity on the 3rd dimension (dimension index=2), it will collapse that index into a single value.

Where is cosine similarity typically used in Machine Learning (ML)?

From this page,

Cosine similarity is a measure of similarity between two data points in a plane. Cosine similarity is used as a metric in different machine learning algorithms like the KNN for determining the distance between the neighbors, in recommendation systems, it is used to recommend movies with the same similarities and for textual data, it is used to find the similarity of texts in the document. So in this article let us understand why cosine similarity is a popular metric for evaluation in various applications.

The reason the author(s) got interested in this metric is because of its use in the SimCLR for the contractive learning of visual representations. When computing the NT-Xent (Normalized Temperature-scaled Cross Entropy) loss, the first step is to perform an all-pairs cosine similarity between all the result feature vectors produced by the model.

We won’t dive into the details of SimCLR or NT-Xent loss for the purposes of this article. Let’s assume that all-pairs cosine similarity between a set of N feature vectors is a necessary step to compute.

Next, we will take a detailed look at how we can mechanically compute all-pairs cosine similarity in PyTorch.

Computing all pairs Cosine Similarity in PyTorch

A quick internet search reveals that folks have been having a hard time finding a succinct and efficient way to perform an all-pairs cosine similarity operation.

In fact, the problem is deemed to be so complex that there’s a metric dedicated to this subject on the torchmetrics page.

Fortunately, there’s a one-line solution to this problem (a variation of which is mentioned on this PyTorch GitHub issue.

If “x” is the input tensor with 2 dimensions (batch, vector-length), then the one-line solution is:

cosine_similarity(x[None,:,:], x[:,None,:], dim=-1)

There’s a lot going on in there and it may not be obvious how it really works, so the rest of this post will focus on dissecting the various sub-parts of the solution by building it up from scratch.

Here’s a link to a notebook implementing all the steps mentioned in this article.

In the following subsections, we’ll learn about the following as it applies to succinctly computing all-pairs cosine similarity.

  1. Indexing a tensor with “None”
  2. Using tensor.expand() to expand tensors along singleton dimensions
  3. Leveraging PyTorch broadcast semantics to implicity expand tensors along singleton dimensions

Indexing a tensor with None

The first thing we need to understand is what happens when you index a PyTorch tensor using None. This stackoverflow topic has the answer.

Similar to NumPy you can insert a singleton dimension (“unsqueeze” a dimension) by indexing this dimension with None. In turn n[:, None] will have the effect of inserting a new dimension on dim=1. This is equivalent to n.unsqueeze(dim=1)

x = torch.randn(3)
# Indexing with None does the same thing as unsqueezing the tensor
# at that dimension. After this indexing operation, the tensors
# x_row_dup and x_col_dup will have 1 additional dimension at
# dimensions 0 and 1 respectively.
x_row_dup, x_col_dup = x[None,:], x[:,None]
print(x, x.shape)
print(x_row_dup, x_row_dup.shape)
print(x_col_dup, x_col_dup.shape)

Prints the following.

tensor([-1.2756, 1.1559, -0.0660]) torch.Size([3])
tensor([[-1.2756, 1.1559, -0.0660]]) torch.Size([1, 3])
tensor([[-1.2756],
[ 1.1559],
[-0.0660]]) torch.Size([3, 1])

Expanding a tensor using .expand(…)

The PyTorch expand(…) API is used to expand the values of certain dimensions by repeating the values at those dimensions. Note that you can only expand a dimension if its value is 1. Let’s see an example below.

x_row_dup, x_col_dup = x_row_dup.expand(3, 3), x_col_dup.expand(3, 3)
print("x stretched across rows")
print(" - - - - - - - - - - - -")
print(x_row_dup, x_row_dup.shape)
print("")
print("x stretched across columns")
print(" - - - - - - - - - - - - - ")
print(x_col_dup, x_col_dup.shape)

Prints the following.

x stretched across rows
- - - - - - - - - - - -
tensor([[-1.2756, 1.1559, -0.0660],
[-1.2756, 1.1559, -0.0660],
[-1.2756, 1.1559, -0.0660]]) torch.Size([3, 3])

x stretched across columns
- - - - - - - - - - - - -
tensor([[-1.2756, -1.2756, -1.2756],
[ 1.1559, 1.1559, 1.1559],
[-0.0660, -0.0660, -0.0660]]) torch.Size([3, 3])

All-pairs cosine similarity

Let’s assume our input tensor has 3 elements, namely (A, B, C). To compute the all-pairs cosine similarity, we will first expand this tensor along 3 rows and 3 columns.

Unsqueeze: Our input tensor (A, B, C) has shape (3). We will first unsqueeze it along the 1st and 2nd dimensions to make it look like the following:

(A, B, C) squeezed along dimension 0 will look like ((A, B, C)) and have shape (1, 3).

(A, C, C) squeezed along dimension 1 will look like ((A), (B), ©) and have shape (3, 1).

Expand: Now, we will expand these tensors along their singleton dimensions (dimensions with value 1) to make both the tensors square.

((A, B, C)) expanded to have shape (3, 3) will look like:

((A, B, C),
(A, B, C),
(A, B, C))

((A), (B), (C)) expanded to have shape (3, 3) will look like:

((A, A, A),
(B, B ,B),
(C, C, C))

If we perform the pairwise cosine similarity (which the PyTorch API can already do), then we get the all-pairs cosine similarity as shown in the figure below.

Figure 1: All Pairs cosine similarity computed between 2 expanded versions of the input tensor (A, B, C). Source: Author(s)

That’s it! Here’s an example that shows it for the tensors we have been using so far.

# Add a dummy dimension at the end so that we can perform cosine
# similarity on that last dimension.
x_row_dup = x_row_dup.reshape(3, 3, 1)
x_col_dup = x_col_dup.reshape(3, 3, 1)
x_cosine_similarity = F.cosine_similarity(x_row_dup, x_col_dup, dim=-1)
print(x_cosine_similarity)

Will print.

tensor([[ 1., -1., 1.],
[-1., 1., -1.],
[ 1., -1., 1.]])

But wait! Why are all the values 1 or -1?! Turns out that the cosine similarity of 2 vectors of size 1 is always +1 or -1. That’s because a single element vector has an angle of 0 or 180 degree depending on whether they are pointing in the same or opposite directions.

Let’s try the same thing with a vector of size 2 instead of 1.

x = torch.randn(3, 2)
x_row_dup, x_col_dup = x[None,:,:], x[:,None,:]
x_row_dup, x_col_dup = x_row_dup.expand(3, 3, 2), x_col_dup.expand(3, 3, 2)
x_cosine_similarity = F.cosine_similarity(x_row_dup, x_col_dup, dim=-1)
print(x_cosine_similarity)

Will print.

tensor([[1.0000, 0.9512, 0.9826],
[0.9512, 1.0000, 0.9920],
[0.9826, 0.9920, 1.0000]])

This definitely looks better!

The last trick: Broadcasting

While we used .expand(…) here, we’d like to mention that it was strictly not necessary since most of the operations defined in PyTorch support a concept called dimension broadcasting. From the documentation,

If a PyTorch operation supports broadcast, then its Tensor arguments can be automatically expanded to be of equal sizes (without making copies of the data).

Hence, the cosine similarity operation when fed with 2 tensors of shape (1, 3) and (3, 1) will broadcast both of them to (3, 3) and essentially implicitly perform the tensor expansion step that we explicitly performed above.

When we run the following code.

x_cosine_similarity = F.cosine_similarity(x[None,:,:], x[:,None,:], dim=-1)
# This should print the same matrix as above.
print(x_cosine_similarity)

We get the following output.

tensor([[1.0000, 0.9512, 0.9826],
[0.9512, 1.0000, 0.9920],
[0.9826, 0.9920, 1.0000]])

Which is in fact the same as what we got above.

A note on efficiency

Compared to many of the solutions mentioned in this discussion, this solution doesn’t use any explicit for loops. Every time you write an explicit for loop, you have the following problems:

  1. There’s some non-trivial CPU computation happening, potentially starving the GPU
  2. If you’re using a for loop, you’re probably leaving some opportunity for parallel GPU execution on the table. This will impact your overall GPU utilization, and hence the time it takes to run your computation

Conclusion

We saw how to succinctly compute the all-pairs cosine similarity between every vector in a list of vectors. Along the way we learned about indexing PyTorch tensors using None to unsqueeze the tensor along that dimension, and how PyTorch broadcast semantics work.

We hope this will help you with your deep learning adventures!

--

--

Dhruv Matani

Machine Learning, PyTorch, CNNs, Transformers, Vision, Speech, Text AI. On-Device AI, Model Optimization, ML and Data Infrastructure. My views are my own.