Self-Supervised Learning

Merantix Momentum

Published in

Merantix Momentum Insights

11 min readSep 27, 2022

Part 2: SSL Methods for Computer Vision

Authors: Alexandra Lindt, Konstantin Ditschuneit

Self-supervised learning (SSL) methods have recently achieved state-of-the-art results in several machine learning areas, such as Natural Language Processing (NLP) and Computer Vision (CV). Therefore, we believe that every machine learning engineer or researcher can benefit from understanding the core concepts of SSL. With our blog post series on the topic, we give an introduction to SSL by putting it in relation to regular supervised learning and detailing some of the most commonly used SSL methods.

In our previous blog post, we introduced a unified view on supervised classification and SSL: Both have a training signal that can be expressed as an undirected graph representing the correspondence between training samples. In this second part of our blog post series, we use this training signal to construct a generic SSL training structure. Starting from there, we will introduce some of the most commonly used SSL methods. Due to the number of available methods, we focus on CV-related ones in this blog post. Similar to the previous post, we are particularly interested in unifying properties among seemingly disparate methods and objectives.

The Data Correspondence Matrix as Training Signal

Given a batch of input samples [x₁ᵒ, … , xₙᵒ] , an SSL method comprises three steps:

Construct positive pairs (xᵢᵒ, xᵢ⁺) with t(xᵢᵒ) = xᵢ⁺ by applying a semantics-preserving transformation t to every input xᵢᵒ. Please refer to our first blog post for a detailed explanation of how such transformations are selected depending on the training data.
Encode xᵢᵒ and xᵢ⁺ with parameterized encoder networks f ᶿ and g ᶲ to compute the embedding representations zᵢᵒ∈ Z and zᵢ⁺ ∈ Z. Note that f ᶿ and g ᶲ can be the same network or two distinct networks of equal or different sizes, the only restriction is that their output lies in the same embedding space Z. We will use K to refer to Z’s dimensionality.
Compute a loss on all embedding pairs (z₁ᵒ,z₁⁺), … , (zₙᵒ,zₙ⁺) and update the encoder networks’ parameters accordingly.

As illustrated in Figure 1, the correspondence between the embedded samples can be expressed as a symmetric matrix G ∈ {0,1}²ⁿ ˣ ²ⁿ. Specifically, for a batch of n positive sample pairs x = [x₁ᵒ, x₁⁺, … , xₙᵒ,xₙ⁺] , G is defined as

G is referred to as the data correspondence matrix and serves as a training signal to optimize the encoder networks’ parameters. Analogous to all one entries of G representing a positive correspondence between two samples, all zero entries of G can be interpreted as expressing a negative correspondence. The respective sample pairs, comprising two different input samples or a sample and the transform of another sample, are referred to as negative pairs. However, note that this interpretation is based on the assumption that all samples in our input data batch are dissimilar to each other, which likely does not hold in practice.

In the following, we will have a closer look at three commonly used SSL methods that differ in leveraging the training signal. While their loss functions all encourage similar embeddings for positive pairs, they employ different ways to prevent dimensional collapse, i.e. predicting the same embedding for all samples. One is a contrastive method, which considers both positive and negative pairs in its loss function. The other two methods are non-contrastive, meaning they only consider positive pairs and use regularizers on the embedding space to prevent the trivial solution.

Contrastive Learning with SimCLR

A widely used SSL approach that seeks to distinguish between positive and negative sample pairs is called contrastive learning.

In contrastive learning, the encoder networks are trained to predict similar embeddings for samples of positive pairs and dissimilar embeddings for samples of negative pairs.

The latter prevents the encoders from collapsing to the trivial solution of simply outputting the same embedding for all input samples.

In the following, we will look in detail at the popular SimCLR objective (Chen et al. 2020), which is also illustrated in Figure 2. Other contrastive learning approaches follow the same training scheme but differ from SimCLR in the exact loss function that they use. Since contrastive approaches consider both positive and negative pairs, their training objective can be interpreted as learning the full data correspondence matrix G.

*Figure 2: The SimCLR training scheme follows the general training scheme of contrastive SSL: Positive pairs are* ***pushed together*** *in the embedding space, while negative pairs are* ***explicitly pushed apart***.

In SimCLR, the widely used cosine similarity, hereafter denoted as cossim, is chosen as the similarity measure between embeddings. For a batch of n positive sample pairs

x = [x₁ᵒ, … , xₙᵒ, x₁⁺, … , xₙ⁺] and their respective embeddings

z = [z₁ᵒ, … , zₙᵒ, z₁⁺, … , zₙ⁺] , SimCLR defines the relation matrix Ĝ as

with the temperature parameter 𝜏 > 0. As cossim is a measure of the relative angle between two vectors, entries in Ĝ will be maximal iff the embeddings of positive pairs are proportional to each other while antiproportional to all other embeddings. The SimCLR loss function can then be expressed as the cross-entropy loss between the data-structure matrix G and Ĝ (Z):

This loss becomes minimal with Ĝ=G.

Non-Contrastive Self-Supervised Learning with BarlowTwins

In non-contrastive methods, the embeddings of positive pairs are pushed together, while there is no explicit force pushing apart the embeddings of negative pairs.

Regularizations on the embedding space prevent the encoder from collapsing to similar embeddings for all inputs. Famous examples of non-contrastive SSL methods are BarlowTwins (Zbontar et al. 2021), BYOL (Grill et al. 2020), and SimSIAM (Chen et al. 2020). The methods mainly differ in their approach to regularizing the embedding space. We will focus on BarlowTwins in the following, but the main takeaways are just as applicable to the other methods.

BarlowTwins formulates its loss based on the K × K cross-correlation matrix C. For a batch of embeddings [z₁ᵒ, … , zₙᵒ] and their positive partners [z₁⁺, … , zₙ⁺], the cross-correlation matrix’s entries are computed along the K dimensions of the embedding space. Its entries Cᵢⱼ contain the cossim of the i-th dimension of all embeddings zᵒᵣ(i) and j-th dimension of their positive partners z⁺ᵣ(j) :

As apparent from the formula, the diagonal elements of C contain contributions of positive pairs on the same dimension of the embedding space Z. The off-diagonal elements contain the correlation between different dimensions of the embeddings. Using C, BarlowTwins aims at pushing the diagonal elements towards one and the off-diagonal elements towards zero with the objective

The first term encourages the alignment of embeddings of positive pairs, while the second term regularizes the embedding space. By trying to equate the off-diagonal elements of C to zero, the second loss term decorrelates the different dimensions of the embedding vectors and consequently reduces the redundancy between them. Figure 3 shows how the loss conceptually fits into the general SSL picture. The hyper-parameter α weighs the two objectives against each other. A low value prioritizes the similarity of positive pairs while a high value emphasizes redundancy reduction in the embedding space.

Figure 3: The BarlowTwins training scheme. The loss function does not explicitly use negative pairs but encourages decorrelation of the embedding space dimensions using the cross-correlation along the dimensions of the embedding space computed by the encoders f ᶿ and *g ᶲ.*

VICReg

Variance-Invariance-Covariance Regularization (Bardes et al. 2022), or VICReg in short, is another non-contrastive SSL method. While conceptually similar to BarlowTwins, it introduces some intriguing characteristics.

Figure 4 illustrates the different components that make up the VICReg training loss. The invariance loss

encourages similar embeddings for positive pairs by penalizing their squared distance. To prevent dimensional collapse, VICReg further regularizes the embeddings of the encoders with two loss terms. The first one is the variance loss, which is defined as

and encourages a high variance in all K embedding dimensions individually. The second regularizer, referred to as the covariance loss, is defined as

This loss term minimizes the covariance between features within each individual embedding zᵣ and encourages the embeddings to be uncorrelated from each other.

Both regularizing loss terms Lᵥₐᵣ and L𝒸ₒᵥ are defined over the batch of embeddings [z₁ᵒ, … , zₙᵒ] and have an equivalent defined over their positive partners [z₁⁺, … , zₙ⁺] , denoted to as Lᵥₐᵣ⁺ and L𝒸ₒᵥ⁺ respectively. The complete VICReg loss is then just a weighted sum of the individual parts:

The hyper-parameters α, β, and γ weigh the loss components against each other. Large values for α and β will prioritize regularization, while a large γ one prioritizes the invariance component. This precise control over the training objective is a key property of VICReg that distinguishes it from all other methods.

Figure 4: VICReg: The loss function considers the Euclidean distance between positive pairs and does not make implicit assumptions about negative pairs. Instead, the outputs of each encoder are separately encouraged to have high variance and low covariance.

On the Dimensionality of the Learned Embedding Space

SSL methods are typically used for pretraining models on unlabeled data in order to reduce the amount of labeled data required for transfer learning on downstream tasks. Consequently, performance on downstream tasks is the primary measure of the usefulness of the learned embeddings.

We consider the dimensionality of the embeddings as a proxy for their usefulness in downstream tasks.

For an intuition on why this is the case, consider a linear classification of the embedding space: If all embeddings lie on a straight line in the embedding space (i.e. use only one dimension of the embedding space), a linear classifier cannot distinguish more than two classes. In case the embeddings lie in a two-dimensional plane in the embedding space, the linear classifier can distinguish up to three classes, and so on. With this in mind, a higher-dimensional embedding subspace is likely to be more valuable for downstream tasks.

While all SSL approaches discussed above make use of the same training signal, their decision to either use negative sampling or regularize the embedding space has implications for their learned embedding manifold.

As we have described earlier in this post, SimCLR and other contrastive methods attempt to approximate the data correspondence matrix G with the matrix Ĝ containing the relationships between the learned embeddings. Since the optimal solution is Ĝ=G, rank(G) is a natural boundary for the dimensionality of the learned embedding manifold. The embedding manifold’s dimensionality approaches the underlying dimensionality of G

During training with mini-batch gradient descent, the matrix G is computed for each batch, connecting the embedding dimension to the batch size n as rank(G)=2n. This gives an intuition for one of the fundamental challenges of contrastive SSL methods: Obtaining high-dimensional embeddings often requires large batch sizes during training.

Non-contrastive methods only consider positive pairs during training, implying that they do not attempt to model the full matrix G. Consequently, the dimensionality of an optimal solution is not necessarily limited by rank(G). Instead, it heavily depends on the regularization strategy employed by the SSL approach.

In the case of BarlowTwins, Balestriero and LeCun (Balestriero&LeCun 2022) find that the covariance minimization term does not incentivize a higher dimensionality:

It seems that the cross-correlation matrix does not provide a strong enough regularization during training to prevent higher-dimensional embeddings to collapse to rank(G).

Balestriero and LeCun further find that the addition of the variance-encouraging term Lᵥₐᵣ in VICReg, helps to learn embedding manifolds of higher dimensionality. Through the explicit separation of its three loss terms, VICReg enables trading fidelity to the given data structure for a higher dimensional embedding. In general, the embedding space’s dimensionality K is the upper limit, i.e.

While these are exciting results, they should be viewed cautiously and do not imply that any method is preferable in all cases.

Conclusion

Having linked SSL with supervised classification in our previous blog post, in this second part, we have used this unified view to introduce the general SSL training structure. We have explained the core difference between contrastive and non-contrastive SSL methods and with the examples SimCLR, BarlowTwins and VICReg illustrated three essential classes of SSL methods in the CV domain. Without going into the architectural details of these methods, we have demonstrated how their different loss functions and approaches to regularizing the embedding space lead to significant differences in the learned embedding manifolds. We further established that VICReg’s explicit control over the individual loss components sets it apart from the other methods and enables high-dimensional embedding manifolds without requiring large training batch sizes.

As a key source for this blog post has only recently been published (Balestriero&LeCun 2022), we expect further research to close the gap between theoretical research and widespread SSL applications. We further anticipate that self-supervised model pre-training will become more widespread in machine learning domains other than CV. It will be a crucial asset for the application of machine learning in domains where there is a lack of manually created labels.

References

(Chen et al. 2020) Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. “A simple framework for contrastive learning of visual representations.” In International Conference on Machine Learning, 2020,
(Bardes et al. 2022) Bardes, Adrien, Jean Ponce, and Yann LeCun. “VICReg: Variance-invariance-covariance regularization for self-supervised learning.” International Conference on Learning Representations, 2022.
(Zbontar et al. 2021) Zbontar, Jure, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. “Barlow twins: Self-supervised learning via redundancy reduction.” In International Conference on Machine Learning, 2021,
(Balestriero&LeCun 2022) Balestriero, Randall, and Yann LeCun. “Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods.” arXiv preprint arXiv:2205.11508, 2022.
(Grill et al. 2020) Grill, Jean-Bastien, et al. “Bootstrap your own latent — a new approach to self-supervised learning.” Advances in Neural Information Processing Systems, 2020.
(Chen&He 2021) Chen, Xinlei, and Kaiming He. “Exploring simple siamese representation learning.” In Conference on Computer Vision and Pattern Recognition, 2021.