Self-Supervised Learning: Challenges and Next Steps?

Jayant Kumar
The Startup
Published in
3 min readSep 13, 2020

Self-supervised learning is a type of learning (representations of image, text, audio, and so on) where the data (not labelled by humans) itself provides some sort of supervision. One of the main ideas is to force the network to learn useful features by doing predictions on some pre-text tasks.

A robot exploring its environment and learning from trial and error

Recently I had a chance to attend a virtual workshop on self-supervised learning (https://sslwin.org/). Ideas and methods on SSL go back to decades ago but it is an active area of research now since many researchers are realizing the limits of supervised learning on “fixed datasets”. Many prominent researchers in the SSL area gave talks in the workshop and I thought it would be good to share a summary/view on this sought-after topic.

Self-Supervision as a Path to a Post-Dataset Era — Alexei Alyosha Efros

Prof. Alyosha gave a nice overview of the “Pre-dataset Era” in Computer Vision followed by “The Dataset Era (2000-present)” to emphasize how Datasets played a crucial role in speeding up the progress in the field. How it became easy to benchmark multiple methods and know if a new method is in-fact improving the state of the art (SOTA)?

He talks about how datasets for training became common and issues such as “Dataset bias” started creeping in.

We are raising a generation of algorithms who can only cram for the test (set). — Alyosha Efros

He then talks about going beyond fixed datasets and proposes to use continual, online learning. He calls it Test-Time Training. I especially liked his points around “smoothness” being an important constraint for this sort of learning. He also mentions that video is a great frontier for self-supervised learning due to its streaming property. He presents an interesting work on self-supervised learning for visual correspondence using a graph constructed from a `palindrome’ of frames.

Self-Supervision & Modularity: Cornerstones for Generalization in Embodied Agents — Deepak Pathak

The next talk by Deepak continues to touch upon similar points. He focuses first on the question “Why do we need self-supervision?”. He then poses three questions on SSL’s (1) Goal (2) Setup and (3) Efficiency. He attributes the slow progress in the field of “robotics” or “embodied agents” to continually, changing test data.

Self-supervised intrinsic motivation to explore the environment could be the key in learning skills needed to perform various tasks

By predicting the consequences of the action and re-iterating on scenarios where the prediction was bad, the agent is able to continually learn on its own. They use the disagreement of multiple models as the notion of curiosity.

He further discusses recent work on incorporating multi-step “planning” in explorations (ICML 2020). They showed that the agent is able to perform as good as with an oracle which knows the rewards in the environment. In a very interesting work, they showed that the robot learned in this manner could imitate human using one sample. Finally, he discusses the idea of bringing modularity to hardware where the controller for each limb/motor could be trained and shared across all the motors/limbs.

Multi-view Invariance and Grouping for Self-Supervised Learning — Ishan Misra

Ishan talks about two key properties that are important for learning representations: (1) Multi-view invariance (2) Grouping. Throughout his talk, he discusses different approaches against these properties.

He argues “pretext” task (pre-2019) based learning representations aren’t actually learning semantically meaningful representations. His own work PIRL uses contrastive learning (CL) to learn features that are invariant to a pre-text task. It has “multi-view invariance” to some extent but it is weak in grouping.

In Contrastive Learning for SSL the positives lack the notion of grouping since its the same sample.

In his work AVID-CMA, authors used the combined features of video and audio to group similar segments to introduce “grouping” in the CL loss. He then presents “SwAV” which is an online algorithm that uses a swapping prediction mechanism to predict the cluster assignment (codes) of a view from the representation of another view.

There were other interesting talks in the workshop and I plan to summarize them in the next part of this post.

--

--

Jayant Kumar
The Startup

I am passionate about technology and how it impacts our daily life. I am a computer vision and applied machine learning researcher/engineer/leader.