Personal Notes of [CVPR 2021 Tutorial] Leave Those Nets Alone: Advances in Self-Supervised Learning

Cathy Chen
14 min readDec 31, 2022

--

Tutorial Link: https://www.youtube.com/watch?v=MdD4UMshl1Q

In this tutorial, we focus on self-supervised methods that lead to useful representations, obtained through the invention of a pretext task and/orby hiding a part or view of the original data to the network.

Contrastive learning

What is contrastive learning?

  • Take an image, sample a “positive” image and “negative” image(s), in some way. Then try fit a scoring model such that:

How contrastive learning works?

Pull the positive example closer and push the negative example farther by applying different loss formula.

Triplet Loss

Try to make the negative examples at least a margin(m) away from positive example.

Negative Sampling in Word2Vec

Negative sampling allows only a small part of the weights to be updated for each training sample, which reduces the amount of computation in the gradient descent process.

Words with higher frequency are more likely to be selected as negative words.

  • P(w) represents the probability of retaining a word.
  • f(w) represents the frequency of the word

InfoNCE

Maximizes the mutual information between x and y by extracting the information they have in “common”.

The optimal score for s(x, y) is given by p(y|x)/p(y) regardless of the number of negative samples used in the loss! (after convergence)

3 Interpretions of contrastive learning

  1. Geometric

2. Language Modeling

  • If the vocabulary size is too big, “Adaptive Importance Sampling”, “Noise-constrastive estimation” or “Negative Sampling” can be applied.

Example: NCE (Noise-constrastive estimation)

Given the context c, the model outputs a score for the next word w:

Convert into a logit and use a correction term for the noise distribution:

Use this score in a logistic regression model to classify between the correct word and noise

Similar to a GAN discriminator

After convergence:

NCE can be also used in Computer Vision:

  • vocabulary size indicates all images
  • We can use a NN to map pixel to embedding
  • Candidate sampling distribution is p(x)

3. Information theoretic

Assume there are 4 classes of images and logP(y) has approximately 50000bits.
If we directly make PixelCNN to learn wihout negative examples, the loss, logP(y|x), has approximate 49998bits (divide by 4 classes so 2 bits are reduced).
Then the model has no incentives to learn.

Our only way is to provide negative examples, so that the model can learn and reduce loss.

However, it still has problem: “Contrastive losses are lazy.”

MI = log(num_of_class) * n

If the model has learned to match the top images correctly (learned 2 bits), then it may misread the negative sample as positive sample. (Since the top image is cat)

Then roughly 1 in 4 samples informative.

Contrastive losses

  • Can make very abstract predictions, but don’t need to be very specific
  • Are lazy — Every extra “bit” of mutual information is twice as hard to learn
  • Will spend more capacity on the harder examples
  • This can be a good and a bad thing

We can analysis from entropy for information.

Entropy is the measurement of disorder or impurities in the information processed in machine learning.

In the following graph, since the difference of the first pair is just noise, the entropy is high. And since the Y of the second pair is audio, it has less bits of information.

Augmentation causes distractors to challenge the model.

Optimal augmentation depends on the downstream tasks

Teacher-student approaches

Input Reconstruction

Perturb an image and then train a network to reconstruct the original version

  • Intuition: to do that the network must recognize the visual concepts of the image

One of the earliest methods for self-supervised representation learning

Types of Encoder

  1. Denoising AutoEncoder
  • Easy, no need for semantics, low level cues are enough

2. Context Encoder

Hard if you don’t recognize the object

  • Requires preservation of fine-grained information and context-aware skills
  • Input reconstruction is too hard and ambiguous
  • Lots of effort spent on “useless” details: exact color, good boundary, etc.

We may try using contrastive learning for input prediction tasks!

Formulates self-supervised tasks in terms of learned representations:

  • Recognize different views of the same image in the presence of distracting negative image views
  • Requires many negative examples but How to choose negatives?
  • Impossible to know whether a sample is actually negative or actually positive (i.e., from the same object)

Hence we need Teacher Student Feature Reconstruction.

Teacher Student Feature Reconstruction

Self-Learning

  • Teacher: generate a target feature vector from a given image
  • Student: predict this target, given as input a different random view of the same image

Goal:

  • focus on reconstructing high-level visual concepts rid of “useless” image details
  • Enforces perturbation-invariant representations without requiring negative examples

Knowledge Distillation

  • Student: trained to predict the teacher target given the same input image
  • Goal: Distill the knowledge of a pre-trained teacher into a smaller student

Comparison

  1. Self-Learning
  • No access to a “good” teacher
  • The student must predict the teacher output given a different version of the image
  • The student MUST surpass the initial teacher
  • Both networks are of the same size

2. Knowlegde Distillation

  • Access to a “good” teacher
  • (Typically) For the same exactly input, the outputs should match.
  • (Typically) Hopefully the student would reach the teacher
  • (Typically) The student network is smaller

Feature Reconstruction with “static” teacher

Predicting bag-of-words (BoWNet)

Feature reconstruction method defined over high-level discrete visual words:

  • Teacher: extract feature maps + convert them to Bag-of-Words (BoW) vectors
  • Student: must predict the BoW of an image, given as input a perturbed version

Bag-of-(visual)-words

Bag-of-(visual-)words are inspired from NLP.
In computer vision are used for computing a single image-level descriptor from 100s-1000s of local patch descriptor

We can compute a dictionary of local features by clustering (e.g. k-means).

Teacher: BoW target generation

Compute bag-of-words from the “pixels” of the teacher feature map

Student: BoW prediction

  • Feature extractor Fs: extract a global feature vector from the image
  • BoW prediction: implemented with a fully connected layer followed by softmax
  • Loss: cross-entropy between the predicted softmax BoW distribution and the target Bow

Model initialization and iterated training

1. Start from a self-supervised pre-trained teacher

2. Train the student on the BoW prediction task till convergence (e.g., 100s of epochs)

3. Update the teacher with the new student and repeat the training process (go to step 2)

BoW reconstruction task: enforces the learning of

1. Perturbation invariant representations

2. Contextual reasoning skills: infer words of missing image regions

Limitation of BoWNET

  • Requires pre-training the teacher with another self-supervised method
  • The teacher remains frozen throughout long training cycles
  • Leads to suboptimal supervisory signal to the student / slow convergence

“Dynamic” teacher-student Feature Reconstruction

Bootstrap Your Own Latent (BYOL)

Feature reconstruction method:

Both input has perturbations

  • Teacher: extract a target feature vector from a random view of an image

The teacher parameter updates ARE NOT NECASSARILY in the direction of minimizing the loss. (Since it stops gradient)

  • Student: predict this target, given as input a different random view of the same image

Student has an extra prediction MLP head

Bootstrap idea: builds a sequence of student representations of increasing quality

Similar to the idea mentioned above

  • Given a teacher, train a new enhanced student by predicting the teacher’s features
  • Iteratively apply this procedure by updating the teacher with the new student

Teacher model is NOT FROZEN!
Called “momentum teacher” or “mean teacher”.

Use exponential moving average for online updating the teacher at each training step:

Conclusion of BYOL

A mean teacher approach without any labels

  • Offers stable but slowly evolving feature targets
  • More efficient than using a fixed pre-training teacher that is updated only after the end of each training cycle (as BoWNet does)

Detour: mean/ momentum teacher in semi-supervised learning

Mean teachers have been shown to improve the results:

  • Similar to temporal ensembles of the student model but instead of averaging the predictions it averages the model weights
  • More stable and accurate version of the student

Detour: mean/ momentum teacher in contrastive learning

BYOL vs Contrastive methods (SimCLR)

  • BYOL does not require negative examples as the contrastive method SimCLR
  • More robust to the choice of image augmentations and the batch-size
  • Cropping is more important for BYOL and color jittering more important for SimCLR

Question: Why it avoids feature collapse?

The teacher parameter updates ARE NOT NECASSARILY in the direction of minimizing the loss, i.e., BYOL does not explictly optimize the loss w.r.t. the teacher parameters.

Batch Normalization (BN) in BYOL implicitly causes a form contrastive learning

  • collapse is avoided because all samples in the mini-batch cannot take on the same value after BN

However, according to BYOL authors “BYOL works even without batch statistics
Either by better tuning the network initialization Or replacing BN with Group Normalization and Weight Standardization (GN + WS)

(Hypothesis in BYOL) Thanks to student’s prediction head and using EMA for the teacher. The momentum teacher allows to have a near-optimal student predictor that forces the student to encode more and more information within its projected features

Simsiam: BYOL without the momentum teacher

the teacher is identical to the student

Momentum teacher: improves performance but not necessary for avoiding feature collapse

DINO: Emerging Properties in Self-Supervised Vision Transformers

“momemtum update” indicates “EMA”

No prediction head — post-processing of teacher outputs to avoid feature collapse:

  • Centering by subtracting the mean feature: prevents collapsing to constant 1-hot targets
  • Sharpening by using low softmax temperature: prevents collapsing to a uniform target vector
  • Loss: Cross-Entropy (CE) instead of Mean-Squared Error (MSE)
  • Momentum teacher: avoid collapsing
  • Better without predictor

OBoW: Dynamic version of BoWNET

  • Fully online bag-of-visual-words generation
  • Representation learning based on enhanced contextual reasoning

Teacher components: (1) network parameters, (2) visual-words vocabulary

  • BoWNet: offline pre-trained; fixed during student training
  • OBOW: Both are online updated together with the student

OBOW can Avoid feature collapse!

Since the BoW targets are computed using a constantly updated set of randomly sampled local features, OBOW by construction does not suffer from feature collapsing, thus making it robust to the momentum coefficient used for the momentum teacher updates.

α = 1 indicates teacher is updated by itself’s parameters

Teacher: Queue-based vocabulary from randomly sampled local feature

Online updating of queue-based vocabulary. At each training step:

  • Randomly select one feature vector per training image as visual word Zoom
  • Add it to a K-sized queue while removing its oldest item/word

Student: Dynamic bag-of-visual-word prediction

  • BoWNet: fixed linear prediction layer for BoW prediction
  • OBOW: constantly updated vocabulary → requires dynamic generation of prediction weights.

Representation learning based on enhanced contextual reasoning

  1. Predicting BoW from small crops of the original image

2. Multi-scale BoW reconstruction targets (conv5 and conv4 layers of ResNet)

  • Also using the conv4 further promotes the learning of context-aware features.

Evaluation Experiment

Evaluating ResNet50 self-supervised pre-trained networks

Conclusions

  • Feature “reconstruction” self-supervised methods are gaining increased attention
  • Manage to learn SOTA self-supervised representations without requiring negatives Surpassing even supervised representations
  • However, it’s not entirely clear why they avoid feature collapse

Recent trends: mid-way between contrastive and feature reconstruction

  • “Whitening for self-supervised representation learning”, arXiv 2020
  • “Barlow Twins: self-supervised learning via redundancy reduction”,
  • ICML 2021 “VICReg: Variance-Invariance-Covariance Regularization for self-supervised learning”, arXiv 2021

Clustering-style methods can be seen as teacher-student approaches.

Clustering-style self-supervised learning

Use clustering to give pseudo labels.

DeepCluster: Deep Clustering for Unsupervised Learning of Visual Features

After gaining the pseudo labels, we can train with supervised learning way.
(The input can be cropped to increase invariation.)

Although DeepCluster helps in SSL. The correctness rate of clustering is not that high (45%)

Limitation of DeepCluster

  • Doesn’t scale.
    For huge dataset, we can only afford 2 epochs. → clustering can be refined only once.
  • Don’t need k-means
    K-means gives centroids and clustering assignments (pseudo labels).
    We don’t need centroids.
  • Need tricks to avoid collapse (one clustering, all stick together)
  • Importance of random cropping is implicit

To overcome these limitation, here presents SwAV

SwAV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

Pseudo Labels in SwAV
Clustering assignment by computing the similarity to centroids.
Can directly use NN to output the scores.

Total score for each output must be the SAME → prevents collapse.
→ Use Sinkhron to adjust the scores.

We compute the pseudo labels in minibatch, so it’s scalable.

Full Framework:

Also inspired by SimCLR:
For each minibatch, they take 2 augmentation from the same minibatch.

MultiCrop

Experiment: Linear benchmark on ImageNet

Although DeepCluster-v2 exceeds SwAV, we still prefer SwAV for its scalability and lower probability to collapse.

By 2020, SSL exceeds Supervised Learning.
However, recent SSL are too similar to each other → perfromance saturate

So scientists seek progress in an orthogonal direction.
And ViT and DeiT are proposed.

ConvNets & Vision Transformers (ViT)

Recently, Vision Transformers (Dosovitskiy et al. 2020) have emerged as an alternative to ConvNets. (E.g. DINO)

K-NN: K Nearest Neighbor clustering algorithm

1. Determine the value of k
2. Find the distance between each neighbor and itself
3. Find the k nearest neighbors to yourself, check which group has the most neighbors, and join which group

So SwAV is improved to DINO by adding

  1. EMA (momentum update).
  2. Teacher’s output: Centering by subtracting the mean feature: prevents collapsing to constant 1-hot targets
  3. Teacher’s output: Shapening

DINO: (See the detail in last chapter)

Self-Attention visualizations

We look at the self-attention of the [CLS] token of the last block

DINO has good interpretability.

The content after 2:47 hasn’t been finished yet…

Multi-modal approaches

What is next?

Reference

--

--