Self-Supervised Learning

Published in

Merantix Momentum Insights

7 min readAug 10, 2022

Authors: Konstantin Ditschuneit, Alexandra Lindt

Self-supervised learning (SSL) has become one of the most popular approaches for learning representations of high-dimensional data. Instead of learning solely from labeled data as in the traditional supervised setting, the central idea is to leverage knowledge about the semantic similarity between data samples.

SSL methods are used for pretraining models on unlabeled data to significantly reduce the amount of labeled data required for fine-tuning downstream tasks.

Since SSL methods already achieve state-of-the-art results in Natural Language Processing (NLP) as well as Computer Vision, we believe that every machine learning engineer or researcher should understand its core concepts. Inspired by Balestriero and LeCun’s insightful review (Balestriero&LeCun 2022), we provide an overview of the underlying properties hidden in today’s SSL methods.

Classification as a graph problem

In conventional supervised learning, we aim to learn a function F: X → Y based on a set of N input-output pairs (x,y) ∈ X × Y. A typical scenario of supervised learning is classification, where X is a collection of data samples x, each labeled with a category (or class) y ∈ Y. For example, X could be a set of animal pictures and Y a set of corresponding animals such as Dog, Cat, Zebra, etc.

Instead of viewing the data as sample-class pairs, we can think of classes as expressing a correspondence: Samples of the same class are connected, while samples of different classes are not.

As shown in Figure 1, a graph representation can be expressed as an undirected graph with a node for every sample x ∈ X and edges between nodes corresponding to samples with the same class label y. Notice that the graph consists of a fully connected subgraph for every class y ∈ Y.

For each graph, we can construct the symmetric adjacency matrix G ∈ {0,1}ᴺˣᴺ that represents the graph structure as

Since

summarizes all the knowledge contained in the labels yᵢ ∈ Y of the dataset, it can serve as a training signal for the classification task. To convert this graph representation into the usual representation of class labels, we can simply assign a one-hot vector to each subgraph.

Self-Supervised Learning as a graph problem

Let’s consider the self-supervised setting, where we have only unlabeled data samples x available.

Unlike in the supervised classification setting described above, the correspondence graph matrix G cannot be directly determined from the class labels of the training samples.

Instead, the approach is to construct G from X solely by using semantic-preserving transformations of the given samples. Formally, we define a set of transformations T for which we assume that they do not alter the underlying semantics S of sample x too much, i.e.

As an example, think of X as a collection of bird images and of T as the set of rotations around 90°, 180°, or 270°:

We know that if we apply one of the transformations t ∈ T to an input image x, the resulting image t(x) still shows the same bird — merely rotated. It is important to notice that by defining a set of semantic-preserving transformations, we make use of meta knowledge about our data set. Specifically, of the knowledge of what makes a sample semantically similar to other samples.

In analogy to supervised classification, we can interpret this procedure as creating a class for each data sample x and assigning the transformed version of the sample t(x) to the original’s class. In a graph representation, this corresponds to an undirected graph consisting of a node for each sample x ∈ X and its transformed sample t(x) with edges between the node corresponding to the original sample and its transform, as depicted in Figure 2.

*Figure 2: Samples x* ∈ *X and transformations t* ∈ *T are used to construct sample pairs (x,t(x)) without any labels. These pairs are called positive pairs (x, x*⁺*) whereas pairwise unconnected nodes in the graph are called negative pairs (x, x*⁻).

The resulting graph can now serve us as a training signal, just as in the supervised case. Analogously, the graph can be represented by its adjacency matrix

We call this framework self-supervised because we only require the data itself and our meta-knowledge about it to construct the graph while not depending on explicitly annotated labels.

Note that in the literature, the tuple (x,x⁺) constructed through transformations applied to the same sample is frequently referred to as a positive pair or related views. In analogy, tuples consisting of two different samples or a sample and the transform of a different sample are referred to as negative pairs.

On choosing semantics-preserving transformations

The quality of the self-supervised training signal is highly dependent on the transformations we choose.

Since self-supervised learning is usually used for pre-training, the choice does not only depend on our dataset and meta knowledge but also needs to consider the downstream task. The pre-training task that results from our chosen transformations must be a sufficiently good proxy for the actual downstream task. In the context of computer vision, these semantics-preserving transformations are also referred to as augmentations.

When working with image data, it is usually quite easy to determine transformations that do not significantly change sample semantics. One example is the aforementioned rotation, others are scaling, random cropping, adding pixel noise, color jittering, or hiding parts of the image. Nevertheless, context must be considered here as well, since not all transformations necessarily work for all types of image data or downstream tasks. To illustrate, consider the MNIST dataset, which contains handwritten digits, including some sixes and nines. As depicted in Figure 3, when rotated by 180 degrees, the semantics are not preserved for all samples.

Figure 3: Some transformations of an exemplary image from the MNIST dataset. While the addition of noise and resizing preserve the image semantics, i.e. the label “six”, the rotation by 180° does not.

For other data modalities, defining expressive positive or negative pairs might require a little more creativity than for image data. One such example is text: a minor change in the tokens of an input sequence may lead to a major change in the sequence’s semantics. Imagine that we created positive pairs by simply randomly cropping our input sequence. We could easily generate the “positive” pair

(“never give up on something you believe in”, “give up on something”)

which consists of two sentences with very different meanings. One can construct analogous examples for similarly simple transformations, such as random word deletion, synonym substitution, or word reordering. Although a multitude of approaches experiments with such transformations, the most successful ones sample positive pairs from document collections (e.g., sentences from the same paragraph or document) or use the same input sentence twice and treat the dropout layers within the language model as semantics-preserving transformations.

The framework of creating self-supervised training signals is more general than only using transformations — we can use any type of information available about our training samples. One example of this would be the case of a video sequence, where we could assume that consecutive frames in a video capture a similar scene and therefore can be considered a positive pair.

Outlook

In this first part of our SSL series, we presented a unifying view on supervised classification and SSL: Both have a training signal that can be expressed as an undirected graph representing the correspondence between samples. The key difference, however, is that in the case of classification, the training signal is the ground truth, while in the case of SSL, we make certain assumptions to obtain the signal. Not only do we assume that the transformations we choose do not alter the semantics of a sample, but we also implicitly assume that all samples in our data set are dissimilar. Although most likely false, the second assumption is necessary to prevent the phenomenon of dimensional collapse, i.e. the trivial solution of predicting all samples to be similar.

In the next part of our SSL series, we will have a closer look at the most commonly used SSL methods and see that there are fundamental differences in how these methods prevent dimensional collapse. Using the graph view illustrated in this post, we will explain how the method’s training objectives relate to their modeling capacity.

References

(Balestriero&LeCun 2022) Balestriero, Randall, and Yann LeCun. “Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods.” arXiv preprint arXiv:2205.11508, 2022.