Self-Supervised Learning (BYOL explanation)

Viceroy

Published in

unpack

6 min readFeb 1, 2021

Tl;dr – It’s a form of unsupervised learning where we allow an AI to self identify data labels.

tl;dr of BYOL, the most famous Self-Supervised Learning model

Imagine we have a lot of unlabeled images, like those collected by google maps. Rather than manually going through every snapshot frame by frame and labeling cars, people, birds and traffic lights, we delegate the task to our self-supervised AI, who can autonomously identify labels by creating a supervised task is created out of the unlabelled data. This supervised task being, find and label the objects in this image.

Context Encoders: Feature Learning by Inpainting by Deepak Pathak et al.

Another example can be, given incomplete data, predict what the missing content should be reconstructed as. If given the first half of a sentence, …

Learning Physical Intuition of Block Towers by Example by Adam Lerer et al.

In this first image, we see an example approached by Berkeley researchers Deepak Pathak et al. who used a “context encoder” to fill in large missing areas of the image, where it can not get “hints” from nearby pixels. It must create windows somewhere in this blank space and the top rail of a door.

This is similar to the idea of helping computers understand Object Permanence, something developed in toddlers around the 2 month mark. Also note that pre-training this model is significantly faster than ImageNet (14hrs instead of 3days) but not as accurate (30% semantic segmentation instead of 48%).

And below that we see an example of simulated future outcomes of boxes with energy-based unsupervised learning. In image A, these Facebook AI researchers estimate that the column of blocks will fall backwards towards the north-west. Notice how the blocks become blurrier the more unsure the model is about its outcome, and therefore creates an average blur.

Depending on the size of the reinforcement reward, aka probability of the outcome/label being correct, or as Yann LeCun would explain, the intensity of the energy, we can let the deep CNN run unsupervised and “attain human-level performance at predicting how towers of blocks will fall”.

This is particularly useful for cars to “predict the future” without having to crash into people, walls or mountain cliffs to understand what might happen after a hard right.

Ok now that we see its usefulness, let’s understand self-supervised learning better. There are a number of these out there, but none more famous (and powerful) than Google Deepmind’s Bootstrap Your Own Latent (BYOL), which trained on 512 Cloud TPUs. $$$$$$$$$…

Bootstrap Your Own Latent (BYOL) by Google Deepmind and Imperial College

BYOL achieved higher performance than state-of-the-art contrastive methods without using negative samples. Why is it important that there are no negative samples?

Many questions with arbitrary answers arise when using negative samples. Where do we get the negative samples from? Should they be uniformly sampled? Should we keep a buffer? Should we order them? What is considered “hard enough” (hard negative mining), etc.. These are pernicious dilemmas that have stagnated the growth of AI since 2010 until BYOL showed it can perform just as well as supervised models ten years later.

How does BYOL work?

There are two neural networks that interact and learn from each other. The first, in blue, is the online network, where θ is the set of parameters that the model learns, and ε are the target parameters (in the aptly name target network). Find the github implementation here.

Given input x, we apply an augmentation function t (e.g. random crop, horizontal flip, color jitter, blur, and so on) to create “augmented views” – distortions of original image, but not changing its essence (aka its semantics).

We map these two slightly different versions through two slightly different encoders (e.g. ResNet 50), f_θ(x) and f_ε(x) to create a representation layer. This usually where many neural networks stop and a ML model (LR, RF, etc) is applied to make predictions on our simple vector representative of the input.

After each step, we use an Exponential Moving Average of the online parameters (θ) to update the target parameters (ε). That way, ε is a lagging average of θ, an idea that comes from the Momentum Contrast Principle in physics. The reasoning behind it being we need a “stable” representation as a target.

Exponential moving average formula, given a target decay rate τ ∈ [0, 1]

So again, we take our two slightly different images, and run it through two slightly different encoders, to get a vector y_θ and vector y’_ε.

The projector (g(y_θ) → z_θ) is not very necessary to the BYOL architecture. It is primarily helpful for changing the dimensionality from 2048 to 4092 or 256 or whatever works best for your model/GPU/budget. We can skip it in most cases.

Finally, we look at prediction layer.

The goal is to find a predictor function q_θ that negates the target network’s augmentation (t’) and minimizes the loss between target and online predictions via Mean Squared Error (MSE) . Expanding according to the BYOL architecture, we stop when q_θ(f_θ(t(x))) ≈ f_ε(t’(x)). Note, q_θ must do this without knowing the target parameters (ε) nor what augmentation function (t’) was used on the input image (x).

Loss Function of BYOL: Mean Squared Error between the normalized predictions and target projections.

Then, symmetrize the Loss by using input vector y_ε from the target Neural Network instead of y_θ (z_θ). Top it off by minimizing their sum (L^BYOL = L(y_θ) + L(y_ε)) with respect to θ via a Stochastic Optimization (SO) method. Finally, we get the essential dynamics of BYOL!

Minimize L^BYOL = L(y_θ) + L(y_ε) with respect to θ only, but not ε, as depicted by the stop-gradient in the original BYOL architecture above.

This function, q_θ, is therefore trained to estimate the expected value E[f_θ(t(x))] for all augmentations t. Which, in short, is trained to ignore augmentations and find the underlying semantics of the image. No matter what garbage input it was given.

In human speak, BYOL is saying: “I don’t care if you crop part of it out, make it grayscale, rotate it 60 degrees, or blur it, the dog in this picture is still a dog”.

Conclusion

BYOL is a form of Self-Supervised Learning with the following steps:

input an unlabeled image
augment differently (random crop, rotate, etc.)
run augmented images through separate encoders (ResNet 50)
try to predict the target (ε) with the parameters we can adjust (θ).

This ultimately makes the final representation independent from the augmentations, which means the final prediction can only include things that are not destroyed by the augmentation — we only retain the semantic information.

It is important to make the remark that self-supervised learning is indeed a subset of unsupervised learning. Unsupervised learning is usually a term referring to the black box process of clustering or dimensionality reduction, whereas supervised learning has verbose and traceable conclusions for regression and classification tasks. We delegate object labeling and thereafter classification to the AI, we forgo the one thing that made it supervised learning — manually defining the ground truths according to human judgement.

One of the great things about Self-Supervised Learning is that by trying to predict the self-output from the self-input, we end up learning about the intrinsic properties / semantics of the object. This is pivotal to identifying objects (e.g. a building) irregardless of the distortions we encounter (shadows, blurs, or even obstruction).

We can now approximate AI somewhere within the range of a 2–12 month old toddler. It has developed object permanence and shape constancy.

Self-Supervised Learning (BYOL explanation)

How does BYOL work?

Conclusion

Written by Viceroy