Paper explained — Bootstrap Your Own Latent A New Approach to Self-Supervised Learning [ BYOL]

5 min readSep 14, 2022

BYOL has first appeared in the paper: Bootstrap Your Own Latent
A New Approach to Self-Supervised Learning by Grill, Jean-Bastien, et al. in 2020.

One of the most successful approaches to self supervised learning in computer vision is to learn representations that are invariant to image distortions. However, a recurrent issue with these methods is the existence of trivial constant solutions, which we call a collapse. To prevent such problems, researchers tend to use negative samples ( like in SimCLR ) in order to minimize the distance between the representations’ projections of positive pairs ( views of the same image ) and maximize it between negative pairs ( views from different images ).

In this paper, the authors propose a new approach to self-supervised image representation learning called BYOL that doesn’t require memory banks, specialized architectures and interestingly no negative samples.

Method:

In this method we use:

Two encoders that take an image as input and output its representation. These two models are referred to as Online encoder and Target encoder. Both encoders share the same architecture: a ResNet-50.
Two projector networks that project the representations of the encoders into a lower dimensional projection space. They are also referred to as Online and Target projectors.
A predictor model that takes an online projection as an input and tries to predict the target projection.

BYOL sketch summarizing the method by emphasizing the neural architecture.

We first start by uniformly sampling a batch of images from the dataset denoted x upon which we create two randomly augmented views using a random transformation function, let’s call these views v and v’.

The views v and v’ are passed into the online encoder fθ and target encoder fϵ respectively that will output the representations yθ = fθ(v) and y’ϵ = fϵ(v’). These representations will be then passed to the online projector gθ and the target projector gϵ to output the projections zθ = gθ(yθ) and z’ϵ = gϵ(y’ϵ).

Finally, the online projection zθ is passed to the predictor qθ in order to predict the target projection z’ϵ and we L2-normalize both z’ϵ and qθ(zθ): the goal is to make qθ(zθ) as close as possible to z’ϵ following the mean squared error:

Note that:

The predictor is only applied to the online network: we only want to predict the target projection based on the online projection and not the opposite. In fact, this is one of the main reasons why this method doesn’t collapse even without the use of negative views.
The loss should be symmetric by also separately passing v’ to the online network <fθ, gθ, qθ> and v to the target network <fϵ, gϵ>, the loss becomes the sum of both losses:

At each training step, we perform a stochastic optimization with respect to the online network only. The parameters of the target network are instead updated with the exponential moving average of the parameters of the online network:

The exponential moving average of the online network

Batch size:

The self-supervised image representation learning methods that use negative views suffer from performance drop when using small batch sizes likely due to the decrease in the number of negative examples. BYOL on the other hand doesn’t depend on negative sampling which makes its performance stable with batch sizes from 256 to 4096.

Image augmentation:

Contrastive methods like simCLR have shown that they are very sensitive to the choice of image augmentations. For instance, simCLR behaves badly when removing the color distortion from the image augmentations set and only using image crops, in this case, the contrastive task is mostly solved by focusing on color histograms.

The performance of BYOL on the other hand is much less affected than the performance of SimCLR when removing the color distortions from the set of image augmentations.

Impact of progressively removing transformations

Target decay rate:

The parameters of the target network are the exponential moving average of the online network, which makes it a more stable version of the online network:

If decay_rate = 0: the target network is instantaneously updated with the parameters of the online network, this destabilizes the training.
If decay_rate = 1: the target network will never be updated, and therefore it will keep its initial random state, this makes the training stable but with no performance improvement.

The following table shows how the performance changes with different target decay rates.

All values of decay_rate between 0.99 and 0.999 yields performance above 68.4% top-1 accuracy at 300 epochs.

The importance of the predictor:

The authors showed that by adding a target network to SimCLR its performance improved (+1.6 points) which proves the efficiency of the target network in such methods. The performance still increases a little bit if we use both a target network and a predictor in SimCLR which is basically BYOL + negative sample.

Please note that:

BYOL with negative samples performs worse the original BYOL without negative sample !!
If we remove either the target network or the predictor, BYOL results in a collapsed solution.

Intermediate variants between BYOL and SimCLR.

References:

[BYOL] [Bootstrap your own latent: A new approach to self-supervised Learning]

[SimCLR] [A Simple Framework for Contrastive Learning of Visual Representations]