The Paradigm Shift of Self-Supervised Learning

Published in

Intuition Machine

6 min readMay 23, 2019

“If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake.” — Yann LeCun

By 2016, Yann LeCun began to hedge with his use of the term “unsupervised learning”. In NIPS 2016, he started to call it in even more nebulous terms “predictive learning”:

A key element we are missing is predictive (or unsupervised) learning: the ability of a machine to model the environment, predict possible futures and understand how the world works by observing it and acting in it.

I have always had trouble with the use of the term “Unsupervised Learning”. In 2017, I had predicted that Unsupervised Learning will not progress much and said “there seems to be a massive conceptual disconnect as to how exactly it should work” and that it was the “dark matter” of machine learning. That is, we believe it is to exist, but we just don’t know how to see it.

In 2018, I conjectured that “progress in unsupervised learning will be incremental, but it will be primarily driven by meta-learning algorithms. Unfortunately, the term “Meta-Learning” had become the catch-all phrase of the algorithm that we ourselves did not understand how to create. However, meta-learning and unsupervised learning are related in a very subtle way that I hope to discuss in greater detail in the future.

In early 2019, I adjusted my tune, it was now high time to discard the notion of “Unsupervised Learning”(UL). That is, there is something fundamentally flawed with our understanding of the benefits of UL. My conclusion was that a change in perspective would be required. The conventional form (i.e. clustering and partitioning) of UL is in fact an easy task. This is because of its divorce (or decoupling) from the downstream fitness, goal or objective function. However, recent success in the NLP space with ELMO, BERT, and GPT-2 to extract novel structures residing in the statistics of natural language has lead to massive improvements in many downstream NLP tasks that use these embeddings. To have a successful UL derived embedding, one can employ existing priors that finesse out the implicit relationships that can be found in data. These unsupervised learning methods create new NLP embeddings that make explicit the relationship that is intrinsic in natural language.

Geoffrey Hinton expresses this clearly in a recent interview in “Architects of Intelligence”:

If I’m just trying to predict what happens next, that’s supervised learning because what happens next acts as the label, but I don ’t need to add extra labels. There’s this thing in between unlabeled data and labeled data, which is predicting what comes next.

Yann LeCun has also begun to notice this paradigm shift when he wrote in his Facebook feed:

I now call it “self-supervised learning”, because “unsupervised” is both a loaded and confusing term.

In a similar manner inspired by the NLP methods, a self-supervised learning system attempts to predict parts of its inputs based on the other parts of its inputs. LeCun further writes:

That’s also why more knowledge about the structure of the world can be learned through self-supervised learning than from the other two paradigms: the data is unlimited, and amount of feedback provided by each example is huge.

It’s important to not overlook that LeCun uses the term self-supervised and not the more commonly used term semi-supervised. This is why learning through intervention is so valuable. An agent that interacts with its environment can exploit knowledge of its actions (which serve as the supervision signal). An agent that interacts in its learning will recognize which world data changes and which remains the same (i.e. predict the consequences of its actions). This is why embodied learning is essential for AI. Embodied learning is what enables self-supervision. Embodied learning is learning as if an agent was its own teacher. Schematically this may look like this:

It is indeed an interesting coincidence that DeepMind recently (April 10) published a blog that addressed the same subject. In “Unsupervised learning: the curious pupil”, DeepMind authors contend that:

the bulk of what is learned by an algorithm must consist of understanding the data itself, rather than applying that understanding to particular tasks.

This implies that there exists a relationship between the inputs so as to be able to predict other inputs when only partial input is available. Léon Bottou of Facebook contributes additional insight to this in his new framework that is based on two ideas. The first idea is that if you can get rid of all spurious correlations then you are left with the ones that hold regardless of context (i.e. the “invariant” ones. The second idea is that data should be separated from the context from which it was collected. In essence, Bottou argues for a more sophisticated approach in disentangling context from data leads to the discovery of richer causal relationships.

Just a few days ago (May 15), Vincent Vanhoucke Principle Scientist of Google blogged about “the Quiet Semi-Supervised Revolution”. Vanhoucke emphasizes:

It’s an exciting time to be revisiting the value of semi-supervised learning in practical settings. Seeing one’s long-held assumptions challenged is a great indicator of the amazing progress happening in the field.

A paradigm shift is when one’s long-held beliefs are challenged. Indeed there is a paradigm shift in Deep Learning that is shifting the very foundations of Deep Learning at its the core. Don’t take my word for it, rather take the word of the more prominent names that I just quoted above. LeCun, Hinton, Bottou, DeepMind, Vahhoucke, etc. This is earth shifting development in the Deep Learning field. This could in fact be the approach that we have been waiting for that will get us to the next level in the capability maturity model. Self-supervision permits Deep Learning models to require fewer data and as a consequence can lead to capabilities we find in biological intelligence.

A recent paper from DeepMind “Data-Efficient Image Recognition with Contrastive Predictive Coding” (Olivier J. Hénaff, Ali Razavi, Carl Doersch, S. M. Ali Eslami, Aaron van den Oord) demonstrates the effectiveness of self-supervision. Their approach shows that with as little as 13 examples per class, they’ve achieved results that outperform the original AlexNet and does no worse than ResNet:

Now compare this graph of experimental data with what Vanhoucke wrote about in his recent blog (note: color switched):

https://towardsdatascience.com/the-quiet-semi-supervised-revolution-edec1e9ad8c

Sufficiently advanced to be magical? Self-supervision is where the magic will be found.

Demis Hassabis May 4 2019 lecture

Further Reading

Cross-lingual Language Model Pretraining

Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding…

arxiv.org

ICLR 2019: Overcoming limited data

Deep learning has proven powerful at many tasks, but requires massive amounts of data. However a number of new…

towardsdatascience.com

Large-Scale Long-Tailed Recognition in an Open World

One day, an ecologist came to us. He wanted to use modern computer vision techniques to perform automatic animal…

bair.berkeley.edu

Meta-Learning Update Rules for Unsupervised Representation Learning

A major goal of unsupervised learning is to discover data representations that are useful for subsequent tasks, without…

arxiv.org

Fast Task Inference with Variational Intrinsic Successor Features

It has been established that diverse behaviors spanning the controllable subspace of an Markov decision process can be…

arxiv.org

Audio Sample from "Almost Unsupervised Text to Speech and Automatic Speech Recognition" - Almost…

Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve…

speechresearch.github.io

The Paradigm Shift of Self-Supervised Learning

Cross-lingual Language Model Pretraining

Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding…

ICLR 2019: Overcoming limited data

Deep learning has proven powerful at many tasks, but requires massive amounts of data. However a number of new…

Large-Scale Long-Tailed Recognition in an Open World

One day, an ecologist came to us. He wanted to use modern computer vision techniques to perform automatic animal…

Meta-Learning Update Rules for Unsupervised Representation Learning

A major goal of unsupervised learning is to discover data representations that are useful for subsequent tasks, without…

Fast Task Inference with Variational Intrinsic Successor Features

It has been established that diverse behaviors spanning the controllable subspace of an Markov decision process can be…

Audio Sample from "Almost Unsupervised Text to Speech and Automatic Speech Recognition" - Almost…

Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve…

Written by Carlos E. Perez