What is Self-Supervised-Learning in computer vision? A simple Introduction.

Published in

Analytics Vidhya

5 min readAug 2, 2020

“If intelligence is a cake, the bulk of the cake is self-supervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning (RL).” — Yann LeCun head of Facebook AI

If you want to see an easy example of self-supervised learning, with code, check out my other post here.

Self-supervised learning has become a hot topic in the field of Machine Learning lately, with several giants of the field (such as Hinton and Yann LeCun) promoting its importance. In this post I am going to attempt to define Self-Supervised Learning, explain why we use it and give you an (extremely) simple example of how it can be used in practice.

First we need to define some terms:

Supervised Learning
Self-Supervised Learning
Pretext and Downstream Tasks

Supervised Learning

The typical supervised learning example can be explained from the example data above. In this case we are dealing with a binary classification problem, where the objective is to classify data samples into class 0 or class 1 based on the feature vector X =(x1,x2).

An example of a supervised learning setup would be that we build a machine learning model that learns how to classify samples into class 0 or 1 by predicting the probability of class 0 given X, p(y=0|X), and then penalizing the system for getting it wrong.That’s supervised learning in a nutshell.

One key implied feature of supervised learning is that the labeled data pairs are generated by human annotation (think humans finding cats or dogs in images). This puts a limit on the amount of data that can be generated for a specific task, and as data is one of the most important limiting factors of performance, this is a problem. Labeled data constitutes a tiny fraction of all the data out there. If we only use our labeled datasets we are leaving a lot on the table.

How do we get around this problem? Enter Self-Supervised learning.

Self Supervised Learning

Self-Supervised Learning is the concept of training a ml-system on a task in which we can generate the input and target pairs (X,y) automatically, thereby forgoing the whole problem of human-data labeling.

Self-Supervised learning is still supervised learning so everything we said about supervised learning still applies, the only difference is in which tasks we are solving and how our labeled data pairs are generated.

In the self-supervised learning paradigm, we want to find some way of generating our labeled data pairs without involving any humans, we want a machine to do it. If we think of our system as consisting of a data generation algorithm and the learning algorithm, it generates “its own labeled data” and so we call it self-supervised.But how do we find a such a data generation algorithm? We need to decide on a relevant pretext task.

Pretext and Downstream Tasks

In computer vision, pretext tasks are tasks that are designed so that a network trained to solve them will learn visual features that can be easily adapted to other downstream tasks.
A Downstream is a task that typically has real world applications and human annotated data.

There are many different kinds of pretext tasks. The simplest ones typically involve augmentation of the input data and then training a network to learn what augmentation has been applied, examples include rotation, color removal, and more. This way we can generate both the input and the solution to the chosen task, automatically.

How does it work in practice? Here’s an example of a very simple setup using puzzle-solving as the pretext task:

We are working with images, and our downstream task is image classification, we want to classify images into the correct category (dog,cat,horse,etc).

However, we only have 1000 labeled images, which proves to be too few samples for our neural network to generalize well.

Instead of initiating a new labeling effort, and employing humans to do the tedious task of labeling new images, we will use self-supervised learning to let out neural net first learn some general features of images, before we fine-tune it on the target classification task.

An example of a pretext task (or auxiliary task) in computer vision, is puzzle solving.

The hope here is that in order to learn how to solve these types of puzzles, our neural network will need to learn some general features of the distribution of images we are using as our self-supervised data set and that these features can be adapted for use in our downstream task. If we fine-tune our neural network on a downstream task, such as object classification then we hope to see that some of the learning transfers.

Now that we have trained a neural network that performs well on the pretext task, what we are really interested in is the feature extraction layers. This is where the lower level and more task-agnostic features will reside and we will re purpose them for our downstream task.

Combining the transplanted feature extraction layers with task dependent final layers, we get our target model.

Downstream Task Supervised Learning Setup

And that’s it!

Now you have a rudimentary understanding of self-supervised learning and its utility.

In summary, we can use self-supervised learning to overcome our lack of labeled data in a given task. We chose a pretext task such that we have a algorithm that can generate labeled data pairs. Then we pre-train our model on this pretext task with lots of data, hoping that the features it learns will be transferable to our downstream task, and the amazing thing is that it often does! In one sentence, self-supervised learning gets us good performance with less human annotated data.

What is Self-Supervised-Learning in computer vision? A simple Introduction.

Supervised Learning

Self Supervised Learning

Pretext and Downstream Tasks

Further reading

Written by Lars Vagnes