SimCLR: Contrastive Learning of Visual Representations

6 min readMar 6, 2020

Semi-supervised learning is finally getting all the attention it deserves. From vision-based tasks to Language Modeling, self-supervised learning has paved a new way of learning (much) better representations. This paper, SimCLR, presents a new framework for contrastive learning of visual representations.

Contrastive Learning

Before getting into the details of SimCLR, let’s take a step back and try to understand what “contrastive learning” is. Contrastive learning is a learning paradigm where we want to learn distinctiveness. We want to learn what makes two objects similar or different. And if two things are similar, then we want the encodings for these two things to be similar as well.

When I train a network for some task, say classification, I am already forcing my network to learn discriminative features, right?

Sometimes high-level features alone aren’t enough to learn good representations, especially when semantics come into play. For example, take a look at this Kaggle Competition. Features like shape and color of the tail of a whale aren’t enough to uniquely identify its species because the semantics for the tails of all whales are very similar. We need more to know what makes two tails distinct.

Let’s take another example. Suppose a person is doing a backflip in two different photos, say in one photo he is doing a backflip on a beach and in the second one he is doing a backflip in the street. If the task is to generate captions and your network outputs “A person is doing a backflip”, then this isn’t a very good caption as it doesn’t take care of the global context (beach vs street).

So, if you want to learn distinctiveness, high-level features alone aren’t good enough and you want both local and global features to be taken into context for better encoding and better representations.

Proposed Contrastive Learning Framework

SimCLR learns representations by maximizing the agreement between differently augmented views of the same data example via contrastive loss in the latent space. It has four major components:

Data augmentation module: This module transforms any given data example stochastically generating two correlated views of the same example, denoted by xi and xj. Here the authors used three simple augmentations: random cropping followed by reseizing to the original size, random color distortions, and random Gaussian blur.
Base Encoder: ResNet-50 is used as the base neural network encoder for extracting representation vectors from the augmented data examples. The output of the last average pooling layer used for extracting representations.
Projection Head: A small neural network, MLP with one hidden layer, is used to map the representations from the base encoder to 128-dimensional latent space where contrastive loss is applied. ReLU is the activation function used in this projection head.
Contrastive Loss Function: Given a set of examples including a positive pair of examples (xi and xj), the contrastive prediction task aims to identify xj in the given set for a given xi.

Algorithm and Experimental Settings

Looking at figure 2, by now you must have realized that the algorithm is pretty simple and straightforward. Here is an overview:

Sampling: For a minibatch of N examples, augmentation is performed for each data point resulting in 2N data points. The selection of negatives isn’t done explicitly. Instead, for a given positive pair, the rest 2(N-1) examples within the minibatch are treated as negative examples.
Similarity metric: For comparing the representations produced by the projection head, we use cosine similarity which is defined as:

Loss function: Based on the similarity, the loss function for a positive pair of example is then defined as:

Here zi and zj are the output vectors obtained from the projection head and 𝟙 ∈ {0, 1} iff k≠ i, and τ denotes the temperature parameter. The loss is termed as the normalized temperature-scaled cross-entropy loss.

Batch Size: The authors experimented for a batch size N ranging from 256 to 8192. For a batch size of 8192 gives 16382 negative examples per positive pair from both augmentation views. Given the fact that SGD/Momentum doesn’t tend to work well beyond a given batch size, the authors used LARS optimizer for all batch sizes. With 128 TPU v3 cores, training a ResNet-50 with a batch size of 4096 for 100 epochs takes ~1.5 hours.
Batch Normalization: In distribute training with data parallelism, the mean and variance for BN are typically aggregated locally per device. If the positive pairs are computed on the same device, this can lead to local information leakage. To avoid this, the authors aggregate the mean and variance for BN overall all devices during training.

Hold on a sec! If you are just evaluating the examples as positive or negative i.e. whether two pairs are the same or not, why not consider using logistic loss? And what’s the deal with the temperature parameter? You thought I wouldn’t notice that? You fool! 😒 🙄

Nice catch. The authors tried the logistic loss as well as the triplet margin loss. Here is the comparison of all three:

The authors noted that the gradients suggest :

l 2 normalization along with temperature effectively weights different examples, and an appropriate temperature can help the model learn from hard negatives.
Unlike cross-entropy, other objective functions do not weigh the negatives by their relative hardness. As a result, we need to apply semi-hard negative mining for these objective functions.
Without normalization and proper temperature scaling, performance degrades a lot. In fact, without l2 normalization, the contrastive task accuracy is higher but the resulting representation is worse under linear evaluation. See the results below:

Some Observations

Apart from the importance of l2 normalization and temperature scaling, the authors noted a few important things:

Composite augmentations: The authors tried various augmentations like color distortion, random cropping and resizing, gaussian blur, etc. but it turned out that no single transformation suffices to learn good representations. Also, one composition of augmentation that stands out is random cropping and random color distortion. If you don’t distort the color, most patches would share the same color distribution and only color histograms alone suffice to distinguish images.
Unsupervised contrastive learning benefits more from bigger models: As the model size is increased, the gap between supervised learning and unsupervised learning shrinks. Difficult task demands higher capacity, hence bigger models. (Doesn’t mean you need to have trillions of params!)
A non-linear projection head improves the representations: This is more of a common hunch. Not all values of a similarity vector would be important. You would want to extract the valuable info in a much smaller dimension for better performance depending on the given task.
Contrastive learning benefits more from larger batch sizes and longer training: Bigger batch size provides more negative examples, facilitating faster convergence. The same goes for longer training.

Here are some of the results for the above-mentioned points.

Conclusion

Overall this is a good paper. In my opinion, finding the right loss function was one of the most important contributions here. The other observations like composing augmentations and bigger models have already been explored in previous works (AugMix and Noisy Student).