Most surprising ideas from NeurIPS 2019

Bharat Prabhakar
5 min readFeb 5, 2020

--

The Conference and Workshop on Neural Information Processing Systems (abbreviated as NeurIPS and formerly NIPS) is a machine learning and computational neuroscience conference held every December — Wikipedia.

There are a lot of blogs and articles on the internet that did an excellent job at summarizing the work published at NeurIPS (definitely a lot better than me!). For e.g., I highly recommend checking this out for excellent notes.

As for this post, I’ve picked papers that I personally found to be extremely interesting and to a large degree very counter-intuitive. These are definitely not theoretically ground breaking, but instead challenge the status quo in a very fascinating manner.

So here are my top four picks:

  1. Training Agents using Upside-Down Reinforcement Learning
  2. Weight Agnostic Neural Networks
  3. Putting an End to End-to-End: Gradient Isolated Learning of Representations
  4. Learning by Cheating

Training Agents using Upside-Down Reinforcement Learning

Summarized Abstract

Traditional Reinforcement Learning (RL) algorithms either predict rewards with value functions or maximize them using policy search. We study an alternative: Upside-Down Reinforcement Learning (Upside-Down RL or UDRL), that solves RL problems primarily using supervised learning techniques. Here we present the first concrete implementation of UDRL and demonstrate its feasibility on certain episodic learning problems. Experimental results show that its performance can be surprisingly competitive with, and even exceed that of traditional baseline algorithms developed over decades of research.

Full Paper

Why’s this interesting?

Personally, I found this to be the coolest idea from the conference (and really simple too!). The bottomline is that at this moment, solving RL problems is incredibly difficult, like really difficult. There are all sorts of issues with existing algorithms like, and you can read an excellent article discussing them here — Deep RL Doesn’t Work Yet. The unfortunate truth is that we’re very far off from having a really good general-purpose learning algorithm for Reinforcement Learning problems, as compared to Supervised Learning problems. RL problems are formulated as Markov Decision Processes (MDPs), and this paper proposes to completely change the learning problem formulation itself, i.e., from MDP → Supervised Learning. This could be game-changing because supervised learning is an extremely well studied space and as a result we have incredibly stable learning mechanisms. This clever re-formulation allows us to frame an RL problem as a classification problem. It’s too soon to say if this will work for all kinds of problems, but there’s hope. And if this works, we could finally have a family of algorithms that work off-the-shelf of RL problems. Just as how comfortable we are in modelling “classification” problems, we might finally be able to do the same for “RL” problems.

Another concurrent work coming out of Berkeley lab focussing on the same core idea. In their words — “the method we get from this is not as effective as state-of-the-art RL, but it is simple — just supervised learning conditioned on rewards”.

Weight Agnostic Neural Networks

Summarized Abstract

Not all neural network architectures are created equal, some perform much better than others for certain tasks. But how important are the weight parameters of a neural network compared to its architecture? We propose a search method for neural network architectures that can already perform a task without any explicit weight training. To evaluate these networks, we populate the connections with a single shared weight parameter sampled from a uniform random distribution, and measure the expected performance. We find architectures that can achieve much higher than chance accuracy on MNIST using random weights.

Full PaperSlidesBlogpost

Why’s this interesting?

Their core thesis is that the neural network architectures alone are powerful enough to solve a problem, irrespective of what the weights of the NN parameters are. And we’ve seen this to be true so far, for e.g., images are processed best by convolutional layer architectures, LSTM-like networks work very well on sequential data. But when it comes to tabular data (the kind that most of the enterprises deal with for various predictions — LTV, Churn etc.), we still resort to standard feed forward architectures. My hypothesis is that we’re yet to discover the right “architectural block” for such kind of structured/tabular data that will just make Deep Learning magically work. This work reinforces my belief because clearly, the architecture alone seems to be making such a crucial difference. There’s been some recent work on this by researchers at Google like TabNet, but it’ll be interesting to see what sort of architectural blocks can be discovered programmatically.

Putting an End to End-to-End: Gradient Isolated Learning of Representations

This paper won the Honorable Mention Outstanding New Directions Paper Award. Check out the list of all the other award winning papers here.

Summarized Abstract

We propose a novel deep learning method for local self-supervised representation learning that does not require labels nor end-to-end backpropagation but exploits the natural order in data instead. Inspired by the observation that biological neural networks appear to learn without backpropagating a global error signal, we split a deep neural network into a stack of gradient-isolated modules. Each module is trained to maximally preserve the information of its inputs. Despite this greedy training, we demonstrate that the proposed method enables optimizing modules asynchronously, allowing large-scale distributed training of very deep neural networks on unlabelled datasets.

Full Paper

Why’s this interesting?

This work is interesting in the long run. It is inevitable that going forward, as the size of the datasets collected explodes, models are going to get bigger and deeper and we’re going to see more and more of self-supervised learning (re: pre-trained language models). This algorithm would allow distributed training in the true-est sense, where you can distribute each layer(s) across different systems and train asynchronously — making training mega models (for self-supervised tasks) incredibly scalable and compute efficient. I’m still surprised that this works!

Learning by Cheating

This paper was actually published at CoRL 2019 conference, but Vlad Koltun (Chief AI Scientist at Intel) presented it during his talk at one of the NeurIPS workshops.

Summarized Abstract

Vision-based urban driving is hard. The autonomous system needs to both learn to see the world (vision) and learn to act (control). We simplify the problem by decomposing it into two stages. We first train an agent that has access to privileged vision information. This privileged agent cheats by observing the ground-truth layout of the environment and the positions of all traffic participants and only learns how to act. In the second stage, the privileged agent acts as a teacher that trains a purely vision-based sensorimotor agent. The resulting sensorimotor agent does not have access to any privileged information and does not cheat. This two-stage training procedure is counter-intuitive at first, but for the first time, achieves a 100% success rate on all tasks on an autonomous driving benchmarks (CARLA).

Full PaperExplainer Video

Why’s this interesting?

The conventional wisdom offered these days is to put everything in a single giant neural network and train everything in a single pass. This paper goes against that and demonstrates how inefficient that is; and shows an ingenious way of solving the self-driving problem better by breaking it into modular learning blocks. I’m very curious to see if this trick can be applied to other large and complex machine learning systems which can be modularised into different components.

Originally published at https://bprabhakar.github.io on February 5, 2020.

--

--