Neural Networks Intuitions: 10. BYOL- Paper Explanation

Published in

The Startup

7 min readNov 9, 2020

Welcome everyone!

The tenth article in my series “Neural Networks Intuitions” is about an intriguing and efficient technique to learn representations from unlabelled data — Bootstrap Your Own Latent(BYOL) — A New Approach to Self-Supervised Learning.

For some background on transfer learning and self supervised learning, please check my previous article on Self-supervised Learning and SimCLR.

Let us understand the problem first and later see how BYOL solves it in an elegant way :)

Problem:

Consider the problem of image classification with limited labelled data. Since we all know neural nets are data hungry, training this classifier on this limited dataset causes the network to overfit on train set i.e. does not generalize well on unseen examples.

So how can we solve this problem of overfitting which happens due to limited data?

Solutions:

Data augmentation
Transfer Learning
Self-supervised Learning

All the above topics are covered in the article I have shared in the beginning. Let me give an introduction to self supervised learning once again and then dive into BYOL!

Self-Supervised Learning:

Self-supervised Learning is the technique of learning rich useful representations out of unlabelled data which can then be used for downstream tasks i.e. use it as initialization and finetune(either the whole network or only the linear classifier) the network on limited data.

Recent SOTA self-supervised learning methods are:

Contrastive Learning:

Now before going into BYOL, we need to understand how contrastive learning(or SimCLR to be more specific) learns representations from unlabelled data.

Contrastive learning is one of many paradigms that fall under Deep Distance Metric Learning where the objective is to learn a distance in a low dimensional space which is consistent with the notion of semantic similarity. In simple terms(considering image domain), it means to learn similarity among images where distance is less for similar images and more for dissimilar images.

I won’t be getting into the details of Contrastive Learning as I have already written an article on distance metric learning. Check out my previous blog post on Distance Metric Learning!

Gist of the approach(from the above article):

Create similar and dissimilar sets for every image in the dataset.
Pass two images(from similar/dissimilar set) to the same neural network and extract low dimensional embeddings/representations.
Compute euclidean distance between both the embeddings.
Minimize loss such that the above objective is achieved.
Repeat 1–4 for large number of pairs(all pairs may be infeasible) until the model converges.

The loss function used here is called as the contrastive loss

Dw = Euclidean distance between representations

And the architecture where the same network(i.e sharing same set of parameters) is used for extracting low dimensional representations for both images in a pair is called the Siamese architecture.

Now, there is one question which may arise wrt the training procedure!

Why do we need to show dissimilar pairs during training? Why not simply minimize the loss function over a set of similar pairs?

Let us assume that the network is trained with only similar/positive pairs.

Since the network now is only going to be trained to output zero(for similar pairs), it can simply learn a constant function(say set its weights to 0). This leads to a collapsed solution.

What does this collapsed representation mean?

It means the representations learnt by the network are not ideal and cannot be used for this task of similarity learning. i.e. does not learn the discriminative features.
Contrastive learning addresses this problem by introducing negative pairs during training(but at the cost of labelled negative pairs).

Great, we have understood what contrastive learning and collapsed representations are and how contrastive learning circumvents this problem by converting prediction(of representations) into discrimination.
But now where does Contrastive Learning fit into Self-supervised Learning?

In a self-supervised learning setting where there are no labels, every image and its augmented view is considered a positive pair and the rest of the images in the batch are considered as negatives and trained with a Contrastive loss(modified). This is what basically happens in SimCLR and representations learnt in this manner are much more useful for downstream tasks with less labels.

Note: Contrastive Learning in the end can be seen as a prediction problem where the objective is to predict the representation for a given image where the target representation is provided by the positive and negative images respsectively. This perspective is important as it helps in understanding how BYOL works!

Collapsed Representations Problem:

Now that we know what collapsed representation problem is, let us see approaches other than contrastive learning to solving this.

Solution: A fixed randomly initialized target network

One approach to solve this problem of collapsed representations is to use a fixed randomly initialized network as a target network and train another network to learn representations.

To make it more clear:

Take a fixed randomly initialized network and name it as ‘target’ network.
Take a trainable network and name it as ‘online’ network.
Pass an input image through target and online networks and extract target and predicted embeddings respectively.
Minimize the distance between both embeddings — euclidean distance or cosine similarity loss.

5. Repeat steps 3 and 4 for all images(unlabelled) in the dataset.

Even though this approach does not result in a collapsed solution, it does not produce useful representations as we are relying on a random network for targets.

But it is important to note that even a network that is trained using this approach was able to achieve 18.8% top1 accuracy under linear evaluation on ImageNet whereas the randomly initialized network only achieves 1.4% by itself.

Okay, this is good. But can we do better than this?

Now that we have evidence that this online network is better than the target network, can we iteratively start using this online network as our target network for subsequent steps and continue to train the online network?

Yes, that’s exactly what happens in BYOL :)

Bootstrap Your Own Latent(BYOL):

Approach:

Take two networks with same architecture: a fixed ‘target’ network(which is randomly initialized) and a trainable ‘online’ network.
Take an input image t and create an augmented view t1.
Pass image t through online network, image t1 through target network and extract predicted and target embeddings respectively.
Minimize the distance between both embeddings.
Update the target network — which is the moving exponential average of the previous online networks.
Repeat steps 2–5.

The online network consists of an encoder f, a projector g and a predictor q — this can be seen as a backbone network(encoder f) with a fully connected layer(projection+prediction) on top. After training, only the encoder f is used for generating representations.

The loss function used is mean squared error — difference between l2 normalized online and target networks’ representations.

After each training, the following update is made:

Target network is updated as the exponential moving average of online network’s weights

Great! Now where does BYOL fit into Self-supervised Learning?

In a self-supervised learning setting where there are no labels, every image and its augmented view is passed through online and target network respectively and useful representations are learnt which can be used for downstream tasks having only few labelled data. One major advantage of BYOL is that it does not require labelled negatives which is the case in contrastive learning in general (although not in SimCLR).

Now back to some questions!

How does this approach of using subsequent online networks as target network result in useful embeddings as well as not cause collapsed solution?

The fact that a network trained with a randomly initialized network as a target network produces better results than the random network itself(but why is it better? seems to be an open question:)) serves as the motivation for using subsequent online networks as target networks and hence learns good representations.

How and where can we apply BYOL in a real-world setting?

For any image classification dataset with less amount of labelled data, one can learn useful representations using BYOL from unlabelled dataset, use it as initialization and then finetune a classifier(either the entire network or the FC layer) on the limited labelled data.

That’s all in this article on BYOL. I hope you all got a good idea of how BYOL works and how it can serve as a new approach to self-supervised learning.

Do check out this excellent paper tutorial on BYOL by Yannic Kilcher:

https://www.youtube.com/watch?v=YPfUiOMYOEE

Cheers :)