Self-Supervised Representation learning

A Study on unsupervised way of training deep learning models in image space.

12 min readMay 18, 2022

Self-supervised learning is a way of training a deep learning model with labels inherently obtained from the data itself. Initially a generative based modelling including Auto-encoders, GANs were used to achieve this but they failed in achieving on-par results when compared to supervised training. Recent developments in contrastive learning has improved the results a lot with some papers discussed below like Swav (Facebook), SimCLR (Google) reaching SOTA on image classification tasks on ImageNet dateset.

Self-supervised learning in computer vision is relatively new when compared to NLP. In NLP, word2vec and language models etc use self-supervised learning as a pretext task and achieved SOTA in many domains (down stream tasks) like language translation, sentiment analysis etc. The NLP-progress keeps a good track of progress in NLP.

This blog post is divided into the following parts

How to measure performance?
When to use Self-Supervised techniques?
Research papers brief discussion.

Measure Performance.

Some of the painful tasks in image-domain are:

some domains have 1000’s of classes.
domains where classes frequently change or introduce new classes.
labelling each and every image.

To understand if our self-supervised learning techniques are helping us or not, we need to have performance metric which helps us understand if we are going in the right direction or not. Our goal should be to,

Achieve higher performance compared to supervised way of training the model.
Reducing labelling time and effort.

In the papers discussed below, achieving higher accuracy on image net compared to supervised training with 10x or 100x less labelling is considered as a good metric to measure the performance of the model.

Note: Even when the accuracy is low for self-supervised compared to supervised counterparts, the trade-off between labelling time and accuracy benefits need to be taken into account in some of the cases.

When to use Self-Supervised techniques?

We clearly don’t have an answer to this and most of the things we are mentioning below comes from common-sense and some of our initial experiments. we will re-iterate these things as our experiments and research in this domain progress.

Some of these goals depend on

compute power at glance
data availability
Development time: labelling time and training deep learning models.
Pre-trained models availability

So we have divided this into 4 sections

In reality, we always fall into domains where obtaining pre-trained models is very difficult to achieve. So requirement of techniques like self-supervised learning is gaining traction and more and more research is focused in this space. Lets look at some of these papers below

Introduction

Self-supervised representation learning is broadly divided into two parts

Generative
Discriminative

Discriminative tasks are further divided into two parts

Auxiliary tasks
Contrastive learning

Contrastive learning is leading the path in terms of performance as on writing. we will add the labels to each paper mentioned below.

Note: The goal is not to introduce each and every paper in this space. we are only introducing those papers which allows us to think in new directions, achieve SOTA or improve greatly from the previous methods.

Research papers

Unsupervised Visual Representation Learning by Context Prediction [Auxiliary]
Image Colorization [Auxiliary]
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles [Auxiliary]
Unsupervised Representation Learning by predicting image rotations.[Auxiliary]
Data-Efficient Image Recognition with Contrastive Predictive Coding [Contrastive]
Momentum contrast for Unsupervised visual representation learning MoCo [Contrastive]
A Simple Framework for Contrastive Learning of Visual Representations SimCLR [Contrastive]
Unsupervised Learning of Visual Featuresby Contrasting Cluster Assignments, Swav [Contrastive]

Unsupervised Visual Representation learning by context prediction [Auxiliary, 2015]

This paper dating back to 2015 is one of the first papers to show improvements in objection detection on VOC without using pre-trained image-net weights.
It uses spatial context as a source of freely available labels to train the network.

images are resized to 150k-450k total pixels by preserving the aspect ratio.
Extract 96x96 patches with 48 pixels gap. Including gaps is important to make sure that boundary patterns and textures are used by network, classify and learn useful features. To further reduce this error, a random jitters between -7 to +7 pixels is applied.
A problem called chromatic aberration is observed and color dropping (randomly drop 2 of the 3 colors in RGB and replace them with Gaussian noise) is used as pre-processing step to avoid this.
This paper uses AlexNet, VGGNet as backbones for its experiments. The centre patch and patch (randomly chosen from the remaining 8) are passed through the network, fused later and are classified into one of the 8 classes.
Using image-net pre-trained weights on R-CNN on VOC 2007–12 dataset achieved 68.2 mAP while using pretext pre-trained weights achieved 61.7 mAP
The learned features can be used for visual data mining. Normalised correlation is used as the metric to find similarity between features.

Image Colorization- Cross channel encoder [Auxiliary, 2016]

Image Colorization can be used as a pretext task to achieve self-supervision. Almost most of the data in many domains comes in color format and training data is practically free as we can take any color image in Lab space and use L as input and a, b as output to train a deep learning model.

Source: Zhang, Isola, Efros, VGGNet used for Colorization

Used color prediction as multi-model classification problem as many objects can take several possible colors. Along with this color-rebalancing , weight on the loss term is used to train the network.
65.6% accuracy is achieved on imagenet classification task when pre-trained on imagenet using image colorization as pretext task and later fine-tuning all the layers. When fine-tuned only the last layer (fc8), colorization pre-trained weights achieved only 52.4% accuracy, while imagenet pre-trained achieved 76.8% accuracy.

Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles [Auxiliary, 2017]

Contrary to context prediction where only two (centre and one of the 8 patches) patches are used at a time, the jigsaw puzzle problem is solved by observing all the patches at the same time.
Uses a context-free network, where image is first resized to 256 preserving g aspect ratio and later cropped to 225 x 225 shape. Authors then obtain 75 x 75 patches and randomly crop 64 x 64 patches from each patch before it is passed through the network. All the features of fc6 layer are concatenated and passed to fc7 layer.

For a 3 x 3 grid there are a total of 9 patches to arrange, In total there are 9! ways (362, 880) to arrange and having this large classification layer (cardinality) is infeasible. So the authors have pre-defined 64 different variations only and trained the network to solve these puzzles. These 64 different variations are obtained using highest hamming distance between the patches locations (ablation study to decide weather to use average, minimum or maximum distance).
Color jittering, Gaps, normalisation and training on both color and black-white images are found effective. On imagenet classification task it achieved top-1 accuracy of 67.6% (all layers are fine tuned), while the supervised counter part achieved 78.2% accuracy. 34.6% accuracy when only fully connected layers are trained.

Unsupervised Representation Learning by predicting image rotations.[Auxiliary, 2018]

The ConvNet is trained to understand the geometric transformations as the image rotations by 0, 90, 180, and 270 degrees.
In order a ConvNet model to be able recognise the rotation transformation that was applied to an image it will require to understand the concept of the objects depicted in the image such as their location in the image, their type, and their pose.

In order to train the AlexNet based RotNet model, authors used SGD with batch size 192, momentum 0.9, weight decay 5e−4 and lr of 0.01, drop the learning rates by a factor of 10 after epochs 10, and 20 epochs, train in total for 30 epochs. During training we feed the RotNet model all four rotated copies of an image simultaneously (in the same mini-batch).
Achieves 36.5% accuracy on imagenet classification task when trained a logistic regression on top of conv5 features of RotNet. 43.8% accuracy if we add 3 non-linear layers with normalisation on top of conv5.

Data-Efficient Image Recognition with Contrastive Predictive Coding AKA CPC 2.0 [Contrastive, 2020]

First in the list, this paper is an improvement of original paper (Aaron van den, 2018) on generalised self-supervision for data which has spatial (vision) or temporal mode (text, audio, video) . We are discussing only CPC 2.0 here because it has various improvements over the initial version and is specifically designed for images.
Using 1% of imagenet dataset with labels and training with CPC representations the model was able to reach 78% top-5 accuracy. Using Resnet 161 as backbone, all labels as input data freezing backbone for classification 71.5 top-1 accuracy and 90.1% top-5 accuracy is obtained.

A batch of images (2, 3, 256, 256) are read and a grid of overlapping patches are obtained say (98, 3, 64, 64) from it (stride 32, size 64 and extracted 7x7 patches from each image). They are then sent to resnet encoder (f(x)) from which output of (98, 1024) is obtained and then reshaped to (2, 1024, 7, 7).
Authors have used a context predictor network referred as masked convnet (It only looks at some part of the image at a time) (g(phi)), which takes a subset of resnet output, say [0:3, 0:3] index and try to predict [3:3+k, 0:3] in x_direction and [0:3, 3:3+k] in y_direction.
Random predictions from other parts of the image or from a different image are considered as negative examples and the adjacent patches are considered as +ve examples. Using these predictions and targets, InfoNCE (Noise contrastive estimation) is used to calculate the loss.
The context predictor network is discarded for downstream tasks and only the resnet encoder is used to extract features from images.

Momentum contrast for Unsupervised visual representation learning MoCo [Contrastive, March 2020]

MoCo achieves 60.6% top-1 accuracy on linear classification of imagenet using ResNet50, on wide resnet50 4x it reaches 68.6% accuracy. It consistently performed well on other detection/segmentation downstream tasks.
Pre-text task is defined as same image with different views is a positive pair, whereas different images form a negative pair. This is called contrastive learning.
Unlike other methods, this paper proposes to maintain a dynamic memory bank of negative samples, which are continuously updated using queues. The accuracy is increased with increasing memory bank samples and saturated at 16384–64536 samples.

There are two different architectures for query network (fq) and key encoder (fk). In the experiments keeping fq and fk independent or the same failed, so authors introduced a momentum function to update the weights of fk, using the following protocol,

updating weights using momentum

The authors tried different momentum values and found 0.999 to work well suggesting that a slowly evolving key encoder is a core to making use of the memory bank.
The sudo algorithm of Moco is as mentioned below

A Simple Framework for Contrastive Learning of Visual Representations SimCLR [Contrastive, July 2020]

A linear classifier trained on self-supervised representations learned by Sim-CLR achieves 76.5% top-1 accuracy matching the performance of supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100x fewer labels.
SimCLR discards memory banks and uses large batch sizes and longer training. It use stochastic data augmentation module that transforms any given data example randomly resulting in two correlated views of the same example.
A base encoder f(.) to extract representation. It introduced a small network called projection head g(.) that maps the representations to the space where contrastive loss is applied. The projection head is a MLP with one hidden layer.
As mentioned before, SimCLR doesn’t use memory bank, instead it uses a large batch of images (N). Augments them resulting in 2N data points. Now there will be 1 positive pair for each image (N) and 2(N-1) negative pairs. They use cosine similarity to find out the differences between the vectors. The final loss is computed across all positive pairs, both (i,j) and (j,i), in a mini-batch. It is termed as NT-Xent (the normalized temperature-scaled cross entropy loss. The sudo algorithm is as mentioned below.

For data augmentation they used random crop and resize (with random flip), color distortions, and Gaussian blur . They used ResNet-50 as the base encoder network, and a 2-layer MLP projection head to project the representation to a 128-dimensional latent space. As the loss, we use NT-Xent, optimised using LARS with learning rate of 4.8 (= 0.3× Batch Size/256) and weight decay of 10−6. They trained at batch size 4096 for 100 epochs. Further more, they used linear warm up for the first 10 epochs,and decay the learning rate with the cosine decay schedule without restarts.

Unsupervised Learning of Visual Featuresby Contrasting Cluster Assignments, Swav [July 2020]

After self-supervised training, linear classification on imagenet with 10% of the labels achieves 70.2% top-1 accuracy, while with 1% of the labels it reaches 53.9% top-1 accuracy. with 100% labels it reaches 75.3% top-1 accuracy while the supervised training achieves 76.5% accuracy. All the experiments used ResNet-50 architecture.
Swav compare cluster assignments instead of their features. Take a batch of different augmented images xi and xj. Now cluster the image features of xi and use these cluster assignments as targets and calculate loss on xj features. Alternatively do this for the other view also. Aggregate the loss and compute the loss.

First an image is taken and we obtain 2 views of the same image xi and xj. these are then passed through an encoder network and we get zi and zj latent space vectors. These latent space vectors and then passed through a projection head similar to the one used in SimCLR and we zi_ and zj_ (128 dim). These features are then normalized and passed through a prototypes layer (Linear) and the embeddings are obtained (3000 dim). These are then passed through a sinkhorn algorithm to compute codes (cluster assignments).
The paper uses a new data-augmentation strategy called multi-crop with a mix of views with diff resolutions in place of two full-resolution views. SwAV works with both small and large batches.
SwAV converges much faster and requires much less memory compared to MoCov2 (no memory bank of 65,536 features used ) and SimCLR (larger batches are not required)

End Notes

This is it. As we reviewed these papers, we found that one of the key metric to look for when trying to bring these into your work flow would be imagenet classification accuracy on 1% labels and 10% labels. This will allow us to compare self-supervised with their supervised counterparts. The recent papers are publishing this metric however the earlier papers just computed the accuracy obtained when fine-tuned on 100% labels.

Presently we have not reviewed any generative based approaches in this blog. Going further we will review BigBiGAN and other autoencoder based self-supervision papers (Their performances are not on par with contrastive learning).

We will also be reviewing more papers in Contrastive learning space with time. Also please suggest if we are missing any important break through papers in this self-supervised learning, we would like to look at it and update on this blog.

Thanks

Work by Prakash Jay and Abhishek Chopde