Exploring the Essence of SimCLR

Joanne Jons
WiCDS
Published in
4 min readJan 11, 2021

SimCLR: A Simple Framework for Contrastive Learning of Visual Representation

All human beings, have the ability to identify objects. We recognise shapes, colours, animals, birds and all kinds of things even though we might not remember what the object actually looks like up close and in detail. We identify that there’s a deer in the above image, based on certain features like the number of legs, the shape of the body, the shape of the antlers and so on. This ability is the basis of certain machine learning concepts like contrastive learning and self-supervised learning which paves the way for the interesting research that I’m writing about, SimCLR.

In the International Conference on Machine Learning (2020), a team from Google Research consisting of Ting Chen, Simon Kornblith, Mohammad Norouzi and Geoffrey Hinton published a paper titled ‘SimCLR: A Simple Framework for Contrastive Learning of Visual Representation’. They proposed a framework that could efficiently learn useful representations without requiring specialised architectures or a memory bank.

Now coming back to contrastive learning, those are algorithms which build on our human ability of recognition and focuses on encoding high level features sufficient enough to distinguish objects. Contrastive learning uses a self supervised approach. Generally, machine learning approaches are classified into 3 main categories: supervised, unsupervised and reinforcement learning.

Types of Machine Learning Algorithms

Self supervised learning is a subset of unsupervised learning as it converts an unsupervised learning problem into a supervised learning problem. A model is trained using labels that are naturally part of the data, rather than requiring separate external labels. The idea of self-supervised learning was actually discussed way back in 1989 in the paper titled Making the World Differentiable by Jurgen Schmidhuber.

This approach is being widely used in the field of natural language processing and have shown that it is possible to achieve fine results by pre-training on a large unlabeled dataset followed by fine tuning on a smaller labeled dataset. In the field of computer vision, a lot of models which include Exemplar-CNN, CPC, AMDIM and MoCo have implemented this approach.

In A Simple Framework for Contrastive Learning of Visual Representations, the researchers have put forward a method which simplifies and improves upon previous approaches to contrastive learning on images. The paper concludes some major findings that enables good contrastive learning which includes comparison of how different learning methods behave when each component is modified in certain ways. The architecture of the SimCLR framework is quite simple. Here are the four major components:

The first component is the Data Augmentation module. This transforms any given image data example randomly into two correlated views of the same image. The transformation happens using augmentation operations like random cropping, resize, colour distortions and Gaussian blur.

Now that we have two different versions of the same image, a Neural Network Base Encoder is used to extract representation vectors from them. The speciality of this framework is that, there are no restrictions on the selection of the network architecture for the encoder.

The third component is the Neural Network Projection Head. The researchers used a multi-layer perceptron with one hidden layer, to map representations to the space where contrastive loss is to be applied. This introduces a non-linear transformation that improves the quality of the learned representation.

The last component is the Contrastive Loss Function. This objective function makes the representation of corresponding pairs to “attract” each other and that of non-corresponding pairs to “repel” each other. The contrastive loss function used here is the Normalized Temperature-scaled Cross Entropy (NT-Xent).

SimCLR’s learning algorithm integrates all the above mentioned components and combines major findings to outperform previous methods for self-supervised and semi-supervised learning on ImageNet.

“A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50.”

Recently, the same group of researchers along with Kevin Swesky developed SimCLRv2 and published it in the paper titled Big Self-Supervised Models are Strong Semi-Supervised Learners. This version employs knowledge distillation which is a shrinking process and makes it easier to deploy the network. Knowledge distillation is a process by which a smaller student network learns from a pre-trained teacher network.

Knowledge Distillation

Techniques that make it possible to train neural networks effectively on relatively few labeled images is bringing a large impact in the field of deep learning and computer vision, because obtaining large labeled datasets is tedious. Research in this area shines a light on problems like diagnosing medical images, detecting defects on a manufacturing line and much more. The progress from SimCLR to it’s second version promises new advances in this domain.

--

--

Joanne Jons
WiCDS
Writer for

Computer science student, reader, writer & musician