Robot Ants and Traumatized Rats: Highlights from NIPS Days 3/4
By Wednesday morning, things at NIPS were starting to fall into a pattern: scribbling notes on the floor of overfull convention-center rooms, hearing casual discussion of eigenvectors and linear mappings in the hallway, becoming acutely aware of the inadequacies of the NIPS bulletin board app for lunch-meeting coordination.
Two days lie behind, two ahead. Maybe this is just life will be now, some part of you thinks: this heady mix of exhaustion, exhilaration, and new ideas. And, hey, maybe eventually the conference center will learn to moderate the room temperature correctly.
Robots Learning to Adapt
Peter Abbeel’s keynote talk began with an almost touching scene: a video of a robot delicately spoon-feeding a human. A moment later, Abbeel noted that this wasn’t an autonomous robot, but rather one controlled by a human operator; the mechanical engineering is there, he said, and now the burden is on the AI community to bring fully autonomous, usefully dexterous robots into being. He broke his talk into six areas, and, in each, discussed some directions of potential progress.
- Faster RL : In this area, he specifically focused on the problem of how long it takes an RL algorithm to learn a task, compared to how long it takes humans to learn. From Abbeel’s perspective, if we have to train RL systems on each task individually, we’ll never achieve broadly useful robotic agents. That motivates his work in an metalearning, a subfield of ML that tries to optimize models for the task of learning well on a broad range of tasks, rather than performing on a single one. One system, called RL², uses the concept of “fast weights” and “slow weights”, where fast weights are trained on a single task, and slow weights are optimized over multiple tasks, with their gradient calculated based on performance across all tasks, given a certain amount of training on each task. Another approach out of Abbeel’s lab is called Model-Agnostic Meta-Learning, or MAML, frames the problem more in terms of transfer learning: it optimizes the weights of a model so that it does well on “fine-tuning” on new problems. A good overview of these methods can be found here, in a blog by Chelsea Finn, the primary author on the MAML paper.
- Reasoning over longer time horizons: Many problems in the real world involve chains of goals, and time horizons significantly longer than RL problems have been able to effectively solve so far, in some part because of the difficulty of credit assignment and backpropogation through very long sets of time steps. One suggestion Abbeel brought up here was the notion of using subpolicies. For example, an agent might learn the subpolicies of “run left”, “run right” and “jump”, and then build a policy out of successive combinations of those subpolicies. This idea inherently means you’re learning on different scales: a master policy that gets updated less frequently, and decides which subpolicies to use, and policies that get updated more frequently with the details of how to perform these subpolicies. This ends up meaning that your master policy has to backpropogate through fewer time steps between reward and the beginning of the chain.
- Efficient Imitation Learning: In some situations, the best way to tell a robot how to behave is by demonstrating the desired behavior, rather than describing it. Although I admit I don’t yet have a clear idea of the mechanisms of the approach, the framing of the problem Abbeel is using in this domain is one-shot imitation learning: trying to design models that can, with a single demonstration of a new task, perform well on the task being demonstrated.
- Lifelong Learning: We may want to design robots that can continually adapt to new environments, and maintain plasticity of learning over time. One way to do that is to train robots to play against “competitors” who also evolve over time, meaning that we optimize the network in favor of configurations that allow quick learning in new scenarios. This is very much in line with the idea of Self Play that has been so effective in the AlphaGo and AlphaGoZero systems, but extends a meta level around that training protocol: training agents on multiple iterations of self-play-over-time, and optimizing in the directions of ones that perform best over that whole time frame.
- Effectively Leverage Simulation: The real world is expensive, and it’s slow. When training an agent, you may much prefer to be able to train it in a code-simulated world, where episodes can be run and reward accumulated very quickly. One approach here is “domain randomization”: when you’re training to, say, pick up blocks, at each episode, randomize some irrelevant aspects of the task (for example: the background color of the walls). The hope here is that, by forcing the model to be robust on many configurations of “reality”, the jump from the simulated version of the task to the real world one will be something it’s more able to easily handle.
- Maximize Signal from Real World Experience: Because, as mentioned, the real world is expensive, it may be inefficient to have a single goal, and only give your agent reward when it successfully achieves that goal. One alternative idea is that of Hindsight Experience Replay, where you parameterize your estimate of an action’s value in terms of (state, action, and goal), instead of just state and action. In this framework, assuming you’re learning in an off-policy way (read: one that’s able to deal with the fact that the distribution of behaviors you see is not generated by your current policy), you can after the fact update the agent’s actions with respect to the goal it actually performed. This way, if a robot brings you a glass of milk instead of coffee, you don’t just say, you failed, you say, “well, in case you ever need to grab a glass of milk in the future, this is how”.
Respresenting the State of the World
Like many of the other spotlight talks, Yael Niv’s talk was interdisciplinary; in her specific case, she brought a neuroscientific and cognitive scientific lens, and applied it to the problem of learning encodings that represent the state of the world.
One key thing I took away from her talk was her theory about the way that inference about causality can impact our internal representations. For this, she gave a few examples. One experiment involved boxes appearing on screen, each of which contained some number of circles, and where the circles were either yellow or red. The human would guess, and afterwards the true number would be revealed. In one condition, the yellow boxes had 65 circles on average, the red 35. In the other world, yellow was still at 65, but red at 55. The investigators’ hypothesis, which was born out, was that in the former case, the subjects would learn that there really are two different underlying distributions, and their guesses for yellow circles would average out to 65. By contrast, the case where the yellow and red are very close together, the subjects wouldn’t be confident enough about separate causes to realize there were two different distributions in play, and on that task, their guesses for the yellow circles averaged out to 60, since that’s the average between the red and yellow cases. In other words, in the first case, the subject learned that keeping track of red vs yellow was relevant for the the task of counting circles, and in the second case, they did not.
Another fascinating experiment along similar lines involved rats, and the question of whether you can cause a rat to “unlearn” the link between a bell and a shock, as measured by the rat’s fear response. Somewhat counterintuitively, even if you switch the rat to a regime of bells with no shocks for a long time, a single bell + shock, or even just a shock, can very quickly re-instantiate the fear response connected to the bell. The theory here is that the rat never really unlearned that connection, it just assumed it was now operating in a new regime where that learning was no longer relevent. To test this, Niv’s lab tried out gradually scaling down the probability with which a bell led to a shock. And, indeed, they found that in this scenario, the rats truly “unlearned” the traumatic association, presumably because they updated their beliefs about a single regime, as opposed to collecting information related to a new regime.
The man I heard speak on Thursday night, David Runciman, was a statistically unusual fellow to be at a conference about statistical methods: he’s a lecturer in politics at Cambridge.
His talk argued that, instead of just focusing on Artifical Intelligences, we ought also be thinking about Artificial Agents, which he defined as non-human entities with decision-making powers and long spans of temporal continuity. The canonical examples he gave of such agents were states and corporations.
He further suggested that, much though we’re afraid of the digital singularity, in a meaningful way, we already saw a “singularity” with regard to artificial agents, that facilitated the exponential growth seen during the industrial revolution. The premise of this association, between the growth of modern states and corporations, and economic growth, was based on the premise that these entities could build projects of scale and complexity that a single human never could. Corporations and states can take on risk (because people expect them to be around to pay debt more robustly than a single human), make long term plans, and organize human effort in ways that weren’t possible before.
Even more saliently, the objectives of these actors (to gain profit and obey market incentives, in the corporate case, or to gain territory and prestige, in the state case) can be powerfully at odds with human flourishing, because these actors often aren’t designed to take account of negative side effects. (Or, in economics-speak: externalities).
Runciman’s “so what” out of this analogy was: in the past several hundred years of states and corporations becoming the dominant actors in our lives, shaping the lives of individual humans in dramatic ways, we haven’t really figured out the best ways to effectively regulate them. How do you effectively punish a state, or a corporation? How do you shape their behavior? These are questions that years and years of regulation and international relations literature has not really come up with a convincing answer to, although democratic control of states is probably one of the best solutions tried so far.
It’s certainly true that corporations are technically very different than states, and insofar as you care about the technical minutia of things like value alignment, the analogy becomes less useful. But I think that it has more relevance than we typically give it credit for
Other Interesting Ideas
- A way to compress gradients to fewer bits, without performance loss: To train bigger models, you often need to train in a distributed way; for really big models with many, many parameters, the size of the gradient updates you send to a central parameter server can become the performance bottleneck. This paper suggests a way to take each gradient value (currently a real valued number) and probabilistically map it to [-1, 0, 1], which you can represent much more efficiently.
- A possible explanation for worse results on bigger batch sizes: It’s a known, though not particularly understood, fact in deep learning that training on large batch sizes performs less well than small ones, which is unfortunate from a parallelization perspective. This paper suggests that the batch size is more likely an issue because of fewer discrete gradient updates (since updates per epoch = total training size/batch size) and random walk dynamics suggest you need a minimum number of distinct gradient updates to converge well.
- A GAN that avoids mode collapse by using Bayesian methods: The key characteristic of Bayesian methods is that everything is distributions, rather than point estimates; this GAN learns a distribution over Generator parameters rather than a single weight; this allows the distribution the generator creates to represent multiple modes, rather than one single example (a common failure mode of GANs)
- Two papers that address the problem of NNs that take sets as input. This is relevant whenever you want a classifier that can take an arbitrary number of inputs, where those inputs may permute in order. A very simple example of this is taking a sum of inputs, but being able to put in arbitrary numbers of inputs. Both systems operate on the principal of learning functions that operate on pairwise combinations of input elements, and then learning based on the (transformed) sum of those pairwise combinations. As best I can tell, the difference between Relational Nets and Deep Sets is that the latter also includes the representation of each individual element, along with the pairwise sum, in its final output layer. This means that you can learn characteristics of an individual in the context of a set, rather than just shared properties of the entire set.
- A paper suggesting you can get uncertainty estimate without Bayesian methods: It uses an ensemble of “vanilla” neural nets, and training on adversarial examples, and shows that this ensemble successfully is less confident on examples it hasn’t seen, a desirable property of uncertain models.
Photo of the Day(s):
Quote of the Day(s) : “[if you need a model build] You could get a human expert in machine learning, but that might not work; from what I hear, all of us are at corporate parties all the time” — A session on self-tuning neural networks
Mood of the Day(s): “NIPS is a marathon, not a sprint”