Self-Supervision and how it changes the way we train AI models.

Published in

Sogeti Data | Netherlands

10 min readJun 13, 2022

Self-supervised learning

In our modern world, data is everywhere. Every day, over 80 years worth of video is uploaded to YouTube, people send a collective 50 million tweets, and 95 million photos are uploaded to Instagram. Yet, the availability of data is still often the most limiting factor in the development of production machine learning applications. In some cases, we can blame restrictive data access policies; more often, the blame lies in the large costs associated with the labeling of data.

This is where self-supervised learning can offer a solution. Self-supervised learning enables us to train models that understand our data without having to rely on expensive labeling, thus enables us to tap into this large, continuously expanding pool of unlabeled data. ImageNet, a large benchmark computer vision dataset, required the labor of almost 50,000 crowdsourced workers to annotate 14 million images and took three years to complete. In comparison, one can relatively cheaply crawl the internet to collect an even larger number of unlabeled images. This enables us to train our models on much larger datasets than we could feasibly label ourselves.

Self-supervised learning?

Before we start, let’s revisit two of the traditional paradigms in machine learning: supervised, and unsupervised learning.

In supervised learning, the machine uses certain feature variables to estimate a given set of target variables. The resulting model can then be used to also predict these target variables for new data points (given the same set of feature variables). These models require on labeled data points during training.
In unsupervised learning, the machine tries to identify (hidden) relations between data points in an attempt to group similar items together. These algorithms do not require any labeled data points.

Self-supervised learning is a term for algorithms that fit right in-between these definitions. As with unsupervised models, self-supervised models do not require their input data to be labeled. Self-supervised models, however, create their own target labels whilst training. The exact mechanism for inferring the labels differs from method to method, through the underlying idea is always that part of the unlabeled data is used or transformed to be used as a target label.

Pre-training and representation learning

Training a model for its intermediate features (read: knowledge) is called representation learning and is a powerful tool to create models for more concrete machine learning tasks. In layman’s terms, modern machine learning models consist of many consecutive stages that build on each other’s knowledge. Each stage transforms the input data in such a manner that the next layer is a bit closer to predicting the target label. The final stage in these models is often a very simple classifier, requiring the previous stages to embed as much useful information as possible into the final layer’s input.

Often the intended purpose of self-supervised models is not to be the best in their self-supervised objective, but to later use the learned information on related tasks that require a similar understanding of the data. These so-called “downstream” tasks can build upon the knowledge of the pre-trained model and rely on the information-dense representations for further processing. This process is called transfer learning, as the knowledge from a model could be transferred to another task. For many cases, only the last element, or classifier, has to be retrained to get good performance.

Schematic overview of how transfer learning works in relation to the standard ML workflow. (From https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a)

Ever since the first performant ImageNet CNNs came out, people learned that the information models learn from classifying ImageNet images, can be used as a starting point to train models on other datasets. These pre-trained CNNs already learned to extract meaningful representations from ImageNet’s images. Self-supervision provides another method to build strong base models for transfer learning without being limited to labeled data for the pre-training task. A self-supervised language model like BERT can gather a deep understanding of the English language by “reading” the entirety of English Wikipedia. The resulting model is usable for many text-based downstream tasks.

How does it work?

Though based on supervised principles, the unlabeled data prevents us from directly training a classifier or regression model on it. These models require some form of self-labeling mechanism to enable the self-supervision. Though there are many strategies, in practice, this means the task is to predict a part of the data based on related parts of data.

We can divide the strategies into three general categories of self-supervision mechanics.

Generative self-supervised models

Generative models are a type of model that can generate new data instances, similar to those found in the training data. Examples you might be familiar with are the models behind the websites ThisPersonDoesNotExist.com or TalkToTransformer.com, both great demonstrations of the generative capabilities of modern AI techniques.Most generative models can be considered self-supervised, as they are only trained on the data samples themselves, without labels. Conceptually, this is a very straightforward method of self-supervision: use the actual data samples as your model’s target. A generative model like an autoencoder (“self”-encoder) receives a data sample as input, compresses this into a condensed embedding and is then tasked to reconstruct the original data sample again. This architecture requires the condensed embedding (representation) to be as informative as possible about the contents of the input.

An example of an image autoencoder (From Zhang, Yifei
, 2018, http://users.cecs.anu.edu.au/~Tom.Gedeon/conf/ABCs2018/paper/ABCs2018_paper_58.pdf)

Real-world example: GPT

Language modelling, the study of determining the probability of text sequences, is arguably the biggest success story for self-supervised learning. One of the first models to use self-supervision for a large-scale pretraining step was OpenAI’s GPT. This generative language model was trained by “reading” a large number of books, after which it could be successfully finetuned to many different downstream tasks. In the researchers’ own words, “this approach works surprisingly well”! The model was trained to select which word was most likely to appear after a given beginning of a text, requiring the model to effectively condense the whole preceding context for the final word classifier. This condensed representation of sentences proved to be an excellent base for other classifiers, as well. With little tuning, a wide variety of derived task-specific classifiers reached state-of-the-art on their respective benchmark tests.

Next iterations of GPT, GPT-2 and GPT-3 expanded on this concept by training on much more data and by having 10x and 1000x larger models, respectively. With these newer GPT models, the researchers discovered something even more interesting. These models had seen so much data and learnt such good causality relations that they could even perform many downstream tasks without any finetuning. In many cases, it sufficed to prompt the model with the task description in plain language, possibly accompanied with one or more example solutions. The newer GPT models could then provide their answers by finishing the prompts with continuations they deemed to be most likely.

GPT-3 Zero, One and Few-shot learning examples (Brown et al., 2020. *Language Models are Few-Shot Learners.* arXiv:2005.14165)

Data-masking self-supervised models

Generative modeling makes a strong self-supervision mechanism that can work for many different forms of data. This mechanism, however, does not always produce strong representational learners. A model such as GPT can only build representations from left-to-right, whilst text often has bidirectional relations. More general: these generative models mainly focus to achieve high reconstruction accuracy and diversity. This requires knowledge of the data, but a higher reconstruction accuracy does not necessarily reflect in a better internal understanding of the data.

Luckily, there is a small conceptual tweak we can make that greatly improves our model’s representational capabilities. Instead of generating data points resembling the source data, we can also provide the model with a corrupted input sample and task it with reconstructing the missing pieces. In essence, these models are denoising autoencoders. Reconstruction of these missing pieces requires the model to have a deep understanding of the context of these pieces. These models also have a high reconstruction accuracy as their target, but the data corruption makes this task more difficult. A higher corruption rate almost always leads to a lower accuracy, but can make the model learn stronger representations.

Real-world example: BERT

Swiftly following GPT, Google researchers presented a model that resolved one of its major representational limitations, it’s left-to-right mechanism. Instead, they replaced the unidirectional left-to-right modeling subtask with a bidirectional one. This “masked language model (MLM)” called BERT, does not reconstruct texts word-for word, as GPT does. Instead, 15% of its input-tokens are masked whilst the model must predict what they originally were. These masked words can be anywhere in the text and the model can use all the surrounding words as context. Though this reduces BERT’s abilities to produce novel text from prompts (like GPT), it enables BERT to use context from both before and after the unknown token, greatly improving its representational capabilities. As with GPT, the actual word prediction is a classification problem performed by a simple classifier on top of the actual language model. To optimally perform this self-supervision task, the language model should thus learn to condense as much information about the context in classifier’s input as possible.

The resulting language model beat all other methods (including the original GPT) many different language understanding benchmarks.

BERT’s token-masking mechanism. (From https://bangliu.github.io/survey/2019/07/01/NLP-Pretraining/)

Contrastive self-supervised models

Where both previous methods were generative, contrastive models take a discriminative approach to self-supervision. The generative (data-masking) models essentially predict data points by similar, though directly related data points. But what if we do not have multiple related data points, or the individual data points are not informative by themselves (such as pixels in an image)? That is where our last category of models, contrastive learners, comes in.

Contrastive learners do not try to predict tokens, as data-masking or generative models do, but instead aim to minimize a similarity function between related (positive) samples whilst maximizing this function between other, randomly picked, (negative) samples.

For example, most image datasets contain just single images without any known relation between each sample. A contrastive model, however, requires at least one related sample per sample. A contrastive model like SimCLR builds this themselves, by performing two random, but different, augmentations on each image. These augmentations heavily distort the source, but still show the same concept. The augmentations are processed by identical networks that output a latent representation for their respective input.

Graphical depiction of SimCLR’s contrastive process. (From https://github.com/google-research/simclr)

The contrastive concept can also be applied to samples of different modalities. OpenAI CLIP, for example, uses contrastive learning between related samples of text and images. This method is not fully self-supervised, however, as it is trained on matched text-image pairs.

Data Fairness

Earlier in this blog, I mentioned how self-supervision allows our models to train on much larger datasets than could feasibly be built if they would require human labeling. Besides being much cheaper to construct, the absence of human labelers removes an important source of induced bias from these datasets. This means that these large-scale unlabeled datasets can also yield more fair models than those trained on human-labeled datasets, given that proper care is taken during the data collection.

An example of these fairer models is Facebook AI Research (FAIR)’s large-scale self-supervised SEER model. The model was trained on 1 billion public Instagram posts from across the globe. Though the dataset is, for obvious reasons, biased on Instagram’s demographic, this demographic contains images from many different ethnicities, cultures, and with an even split in genders. The researchers tested the biases in their model by training a simple gender classifier on their pre-trained SEER models and a model pre-trained on the ImageNet dataset. Both the ImageNet and SEER-based models performed very well on white males, but the SEER models scored significantly higher when it came to detecting the gender of women and people with a darker skin tone.

Gender retrieval performance of SEER across different demographics. (From https://ai.facebook.com/blog/seer-the-start-of-a-more-powerful-flexible-and-accessible-era-for-computer-vision/)

Geographical and gender distributions of SEER’s pretraining data (From Goyal, P., et al. , 2022, *ArXiv, abs/2202.08360*)

FAIR researchers showed self-supervised pre-training can result in fairer models, this is sadly not always the case. Training on uncurated internet-scale datasets also brings in internet-scale biases. This is especially impactful for the recent massive self-supervised language models. As anyone who ever opened Twitter or Reddit knows, the internet contains a lot of polarizing opinions. Machine learning models do not know what is righteous, only what words are statistically most likely to co-occur in their training data. This results in models like GPT-3 being over three times as likely to complete sentences about Muslims in a violent manner than if the sentence would have started about Christians (Abid, A., Farooqi, M. & Zou, J, 2021). Whilst one could say the online data represents our society, one could hardly call these outcomes fair.

GPT-3 is over three times more likely to to produce violent outputs from prompts about muslims than about christians and even more so than jews, buddhists or atheists. (From Abid, et al., 2021)

Wrap-Up (My thoughts about the future)

Self-supervision is a powerful method to help our models better understand the world around them. I not only expect them to stay, but I also expect them to define how we approach many of our future ML projects. Their ability to free us from the tedious task of labelling data manually, enables better transfer learning and create more unbiased data. This allows us to use more and more varied data than ever before, in a more powerful way. Will all models eventually be built from a self-supervised base? Probably not. Will it help us build models that are simply not feasible today? In my opinion, most definitely.

Sources

Abid, A., Farooqi, M., & Zou, J. (2021). Large language models associate Muslims with violence. Nature Machine Intelligence, 3(6), 461–463. https://doi.org/10.1038/s42256-021-00359-2