Why Use InfoNCE Loss in Self-supervised Learning

Priyanshu maurya
5 min readJun 26, 2024

--

To better understand InfoNCE loss, let’s assume you need to train a model that converts textual or pictorial representations into lower-dimensional embedding or latent vectors. You come up with a solution: a model f that takes an input x and outputs a latent vector z. Additionally, you create a classifier model h that takes as input the latent vectors z and z* (from x* ) and classifies them into 0 or 1, where 0 indicates they are the same and 1 indicates they are different. Since this is a binary classification problem, we can use binary cross-entropy loss and train both f and h using back-propagation.

Formally, we have:

  1. The embedding model f :

2. The classifier model h :

where y_^ is the predicted label (0 or 1).

The binary cross-entropy loss for a single pair (z, z*) is given by:

Here, y is the true label (0 or 1).

To train the models f and h, we minimise the binary cross-entropy loss over the training set using back-propagation.

In this approach, we assume that the output of model f is a latent or embedding vector, meaning the knowledge resides in model f . The classifier h is then used to classify these latent vectors based on whether they come from the same source or different sources.

However, there’s an important consideration: During back-propagation, there is no explicit condition or restriction enforcing that model f should produce meaningful latent vectors by itself. It is possible that the useful representation or the true latent vector is formed somewhere between the combined operations of models f and h . This means that the classifier h might end up learning features that compensate for any shortcomings in the latent vectors produced by f , rather than f learning to produce high-quality embedding independently.

Therefore, while training f and h together, it’s crucial to ensure that f is incentive to learn meaningful representations on its own, possibly through additional regularization or constraints.

One of the possible solution is to use single layer fnn for classification. (This method is also known as scoring function method)

Or instead of classification we introduce some kind of loss directly to the embedding or latent vector. And here we use InfoNCE.

To use infoNCE loss we use pair of data as -

  1. Anchor : image / text from some class
  2. Positive : image/ text from same class as Anchor or augmented view of Anchor
  3. Negative : from different class.

We get the latent / embedding from the model and then we infoNCE loss.

Now lets understand the infoNCE loss —

InfoNCE, short for Information Noise-Contrastive Estimation, is a loss function commonly used in self-supervised learning, particularly in representation learning tasks. It aims to learn representations by maximizing the mutual information between semantically similar data points (positive pairs) while minimizing it with dissimilar ones (negative pairs).

  • We have a batch of data points, and for each data point (called an “anchor”), we have one positive sample and N-1 negative samples.
  • Let z_i be the representation of the anchor data point.
  • Let z_j be the representation of another data point (positive or negative).
  • We use a similarity function (often cosine similarity) to measure the agreement between representations:

We want to maximize the probability of the positive pair among all the possible pairs (1 positive + N-1 negatives):

where:

  • τ is a temperature parameter that controls the smoothness of the probability distribution.

The InfoNCE loss is simply the negative log-likelihood of the positive pair:

import torch
from torch import nn
import torch.nn.functional as F
import torch.distributed as dist

class InfoNCE(nn.Module):
def __init__(self, temperature, device):
super().__init__()
self.temperature = temperature
self.device = device

def forward(self, query, pos, neg):
'''
Use other samples in batch as negative samples.
query, pos, neg : [B, E]
where B is a batch_size, E is an embedding size
'''
# Normalize
query = F.normalize(query, dim=-1)
pos = F.normalize(pos, dim=-1)
neg = F.normalize(neg, dim=-1)

# Compute cosine similarity locally
logits_pos = query @ pos.T
logits_neg = query @ neg.T

# Concatenate logits
logits = torch.cat((logits_pos, logits_neg), dim=1)

# Generate labels
local_batch_size = query.shape[0]
labels = torch.arange(local_batch_size).to(self.device)

# Cross-entropy loss
loss = F.cross_entropy(logits / self.temperature, labels, reduction='mean')

return loss

Minimizing the InfoNCE loss encourages the model to:

  • Increase the similarity between the anchor and its positive sample (maximize the numerator).
  • Decrease the similarity between the anchor and all negative samples (minimize the denominator).

The intuition behind InfoNCE is to train the model to distinguish the positive sample from the negative samples based on their representations. By maximizing the mutual information between positive pairs, the model learns to extract relevant features that capture the underlying semantic similarity.

InfoNCE is a lower bound on the mutual information between the input data and its representation. Maximizing this lower bound effectively maximizes the mutual information, leading to better representations.

Some pros/cons —

  • Effective for self-supervised learning.
  • Learns representations that capture semantic similarity.
  • Relatively simple to implement.
  • Requires careful selection of negative samples.
  • Can be computationally expensive for large batch sizes.

Applications:

InfoNCE loss finds applications in various domains, including:

  • Image recognition
  • Natural language processing
  • Recommendation systems
  • Anomaly detection

Overall, InfoNCE loss provides a powerful framework for learning informative representations by contrasting positive and negative samples, making it a valuable tool in the field of representation learning.

--

--