Self-Supervised Learning: Learning as Humans Do

Published in

comsystoreply

4 min readFeb 6, 2023

Introduction

Self-supervised learning is one of the latest breakthroughs in the field of AI. It also builds the foremost technology behind DINO, a cutting-edge approach to computer vision from Meta AI. However, before we can apply DINO to real-world use-cases, we need to understand what self-supervision implies. This article will aim at introducing self-supervised learning, and will be the first blog-post in the series of the articles elaborating the theoretical background behind DINO. In the upcoming blog posts, we will also focus on visual transformers and other necessary technologies required for training the DINO model.

Learning from human learning

Self-supervised learning elegantly tackles the bottleneck of data absence in machine learning. Self-supervised learning doesn’t require human labelled data in order to learn meaningful data representations, it automatically creates labels from the input data itself. For instance, in NLP, a model would learn textual representations by masking a word in a sentence and using the word as a label. In computer vision, the model would take out a part of an input image, and the learning task would be to predict the missing part of the image. Before a model can predict the generated label, it will have to understand the contextual information surrounding it. The inspiration is drawn from human learning: a child learns common world knowledge only little by supervision, and a great deal by interacting with the world independently. This exact motivation behind self-supervised learning is being advocated by Yann LeCun, one of the fathers of Deep Learning and a mastermind behind self-supervised learning.

Self-supervised Learning vs. Supervised and Unsupervised Learning

The performance of machine learning models is extensively dependent on the availability of the training data. Self-supervised learning is a new approach in the AI community that aims at solving the challenges of representation learning, which supervised, and unsupervised learning algorithms fail to do. Supervised learning is a holy grail for generating labels that offer the reliability of hand-labeled data. However, manually generating annotated data is an expensive task regarding time, money, and computational resources.

In the blog post Self-supervised learning: The dark matter of intelligence, Yann LeCun and Ishan Misra describe the main advantage of self-supervised algorithms as learning the world knowledge behind the data by interacting with the data itself. In addition to avoiding the costs of the manual tagging of the data, self-supervised models acquire domain-specific contextual information that goes beyond learning only from the specific labels like supervised models do. Cross-domain knowledge transfer that distinguishes self-supervised algorithms further enriches the ability of the learning algorithms to generalize to new unobserved domains. Needless to say, knowledge transfer has always been a challenge for supervised learning models.

At this point, it’s important to note that self-supervised learning models still require human intervention to define the methods of how to learn from the data. Precisely this distinguishes self-supervised learning from unsupervised models that learn representations by clustering, dimensionality reduction, or grouping. In a self-supervised approach, we still have labels to leverage the performance of the models, but we avoid the costs of generating them.

How does self-supervised learning work?

As mentioned above, self-supervised learning models mask some parts of the input data and then try to predict the hidden part. That implies that supervision is a natural part of the data. In computer vision, several image manipulation methods could be used to generate necessary training data. To name a few, image inpainting, generating corrupted images, automatic colorization, misplacing image segments have been used as data labeling techniques.

Image inpainting is one of the simplest examples of image manipulation used in self-supervised learning. The image below illustrates the results of the original paper from Deepak Pathak on image inpainting with a CNN model trained on ImageNet and Paris Street View Dataset.

This approach introduces semantic context encoders as a pretraining step for image classification and object detection. The authors successfully train CNN models to learn the surrounding information in the image and then predict the missing pixels. They also confirm the statement postulated by the advocates of self-supervised learning: the models trained in a self-supervised manner are more scalable and transferrable to new domains. In layman’s terms, just like a child that will identify a panda bear in the zoo, after only seeing a few pictures of a panda, a self-supervised algorithm can perform the same task without the costly error-prone manual annotation of the data.

Success Stories

Self-supervised learning is a well-researched and widely used technique in the leading research organizations. Some of the most famous examples in NLP include Google’s Bidirectional Encoder Representations from Transformers (BERT) and GPT-3, an autoregressive language model from OpenAI.

As mentioned in the introduction, last year marked the introduction of DINO, a self-supervised learning model with vision transformers for computer vision from Meta AI. In the future, we will provide articles that elaborate on other theoretical preliminaries of the model like visual transformers, and also apply it to real-world use cases.

Passion, friendship, honesty, curiosity. If this appeals to you, Comsysto may well be your future. Apply now to join us!

This blog post is published by Comsysto Reply GmbH