Do Androids Dream of Success?

Defining a machine’s dream sequence to improve learning

Ni Lao

Published in

Mosaix

7 min readMar 1, 2019

Image courtesy of Robot Dreams (The Robot Series), Byron Preiss Visual Publications, 2012

Reinforcement Learning with Optimized Dreaming

At Mosaix, we build artificial intelligence to better understand human language. Recently, we’ve been looking at the impact of dreams or optimized memory replay strategies on learning and “success.” The results as published in our latest research paper have led to faster training times and accuracy rates.

Learning Through Reinforcement. Humans A+ : Androids C-.

Growing up, we have all been molded by some form of reinforcement learning (RL) or learning through feedback in the form of rewards and punishments. This positive/negative reinforcement is used in methods such as house training a puppy. If your puppy goes to the bathroom outside, you give him a doggie treat to reward him and reinforce the behavior. However, if he goes in your living room, then no reward is provided because you don’t want a repeat performance of that particular behavior.

Reinforcement learning applied to house training a puppy. Image Credit: @mrdbourke

There has been a recent surge in applying RL to various domains including program synthesis, dialogue generation, deep architecture search, Atari games and continuous control. Although, current trial and error training procedures rely on a huge number of attempts, the performance is still hit or miss. In contrast, humans can reliably learn from only a few examples.

“Whenever someone asks me if reinforcement learning can solve their problem, I tell them it can’t. I think this is right at least 70% of the time.” said Alex Irpan from Google Brain Robotics in his blog post. AlphaGo “was an unambiguous win for deep RL, and that doesn’t happen very often… It’s disappointing that deep RL is still orders of magnitude above a practical level of sample efficiency.” For example, the best learning algorithm (DeepMind RainbowDQN) “passes median human performance on 57 Atari games at about 18 million frames (around 90 hours) of gameplay, while most humans can pick up a game within a few minutes.”

Image courtesy of Alex Irpan from Google Brain Robotics

Classical textbooks (such as Sutton & Barto) introduce RL algorithms which use “on-policy” or real-time optimization — the agent’s model only gets updated while it is interacting with the environment according to the current model. This “learning through practice” strategy, though mathematically correct, is data inefficient, and often doesn’t lead to finding good solutions. In addition, it also leads to the “classic exploration-exploitation problem that has dogged reinforcement learning since time immemorial. Your data comes from your current policy. If your current policy explores too much, you get junk data and learn nothing. Exploit too much and you burn-in behaviors that aren’t optimal.”

Experience Replay and Its Past Limitations

Naturally, we might think that learning through reflection on past experiences (or “experience replay” in RL) should improve the data efficiency and stability of learning. This strategy has been widely adopted in various deep RL algorithms, but its theoretical analysis and empirical evidence is still lacking.

For example, studies of various RL tasks (e.g., Atari games, Mujoco tasks, physical robot arm, and DeepMind control suite) all set their replay buffer size to 10^6 , however a recent study showcases “that both a small replay buffer and a large replay buffer can heavily hurt the learning process”. This is because the algorithm does not consider the usefulness of past experiences and deems them all equally important which is inefficient and hinders learning.

Expanding on our puppy house training example. In the early days of training, if the puppy’s main environment is in the house, he does not distinguish between the thousands of no reward experiences indoors versus his few positive experiences outdoors. All of his experiences are equally valued and replayed which can be overwhelming and hurt the learning process.

Brain overloaded with experiences. Experience replay with no optimization.

But we know that dogs and humans don’t weight all experiences equally. How do we prioritize and sample past experiences to optimize learning?

Mammals Learn from Past Experiences by Dreaming

In early 2000s, scientists from MIT discovered that animals have complex dreams and are able to retain and recall long sequences of events while they are asleep. While Rapid Eye Movement (REM) replay typically lasts several minutes and playback experiences approximately run in real-time, the Slow Wave Sleep (SWS) replay is intermittent and brief, each compressing the behavioral sequence time by approximately 20-fold. Related work on humans also suggests that the amount of REM and SWS sleep is correlated with subsequent performance enhancement on learned tasks.

One fundamental question that arises is “Given animals accumulate numerous experiences daily, which experiences should they dream about to achieve optimal learning?”

Image courtesy of Kote on Drawception.com, 2012

Recent studies in sleep and dreaming indicate that by consolidating memory traces with high emotional/motivational value, “sleep and dreaming may offer a neurobehavioral substrate for the offline reprocessing of emotions, associative learning, and exploratory behaviors, resulting in improved memory organization, waking emotion regulation, social skills, and creativity.” A recent experiment on other animals at the University College London also found that when rats rest, their brains simulate journeys to a desired outcome such as a tasty treat in its maze environment. The scientists concluded that “such goal-biased replay may support preparation for future experiences in novel environments.”

Reinforcement Learning with Optimized Replay Strategy

Research on how humans and animals can learn so efficiently and reliably gives rise to our framework of reinforcement learning with a memory of past experiences or ‘experience replay’. The general idea is that learning (specifically the optimization of a policy-encoding deep neural model) is more effective if the memories of interesting experiences can be incorporated. Because high-reward experiences remain in the memory, they will not be forgotten, and can be repetitively revisited during training if necessary.

Puppies and humans automatically start to weight their experiences and dream about them more frequently to repeat or avoid certain outcomes.

By focusing on more important memories, learning is easier to achieve.

Our previous work applies this methodology and linearly interpolates the maximum likelihood (ML) training objective and reinforcement learning (RL) objective. The RL objective represents the expected performance of an agent, which is the goal of a task. On the other hand, the ML objective measures how similar the agent’s behavior is to an oracle behavior, which both speedups and stabilizes training. When applied to the task of question answering from a large knowledge graph (WebQuestionsSP), it achieves state-of-the-art performance with weak supervision. Despite its effectiveness, this interpolation strategy introduces bias (not directly optimizing the RL objective) and also depends on the assumption of a way to identify for each task (a question to be answered such as “which is the second largest city in US”) the best performing experience (the step of generating a logic program, which compute the correct answer).

In our new paper, we further develop the experience replay technology by removing the ML bias and the assumption of a single most effective experience. Here, we still consider the task of weakly supervised program synthesis from natural language, but on a more challenging dataset, WikiTableQuestions. This task involves more complex programs to compute answers from Wikipedia tables. It is challenging because of its large search space, and noisy reward signal — given a question, many programs may generate the correct answer, but only one of them corresponds to the semantics of the original question.

Our new approach optimizes the agent’s model by separately handling the experiences inside the memory versus those freshly generated by the agent’s current model. Because the memory is often very big, one experience is sampled for training each time. We showed that as long as this sampling is done according to both the agent’s current model probabilities and the experiences’ reward values, then the overall training objective is still unbiased. Furthermore, since the important experiences are sampled more often, the variance of gradient estimations gets minimized, which leads to faster training.

On the challenging WikiTableQuestions benchmark we achieved an accuracy of 46.2% on the test set, significantly outperforming the previous state-of-the-art of 43.7%. Interestingly, on the Salesforce WikiSQL benchmark, we also achieved an accuracy of 70.9% without the supervision of gold programs, outperforming several strong fully-supervised baselines.

Acknowledgement: Thanks to Chen and collaborators for the great work. Also thanks to Esther Lee, John Torres, Wenyun Zuo, Cheng He, and Sumang Liu for their help in preparing this article.

References

Do Androids Dream of Electric Sheep? by Philip K. Dick. 1968, Doubleday
Temporally Structured Replay of Awake Hippocampal Ensemble Activity during Rapid Eye Movement Sleep, Kenway Louie, Matthew A.Wilson, Neuron, 2001
Memory of Sequential Experience in the Hippocampus during Slow Wave Sleep. Albert K.Lee, Matthew A.Wilson, Neuron, 2002
Rats dream about their tasks during slow wave sleep, MIT News, 2002
Sleep and dreaming are for important matters, L. Perogamvros, T. T. Dang-Vu, M. Desseilles, and S. Schwartz, Front Psychol. 2013
Do Rats Dream of a Journey to a Brighter Future?, Neuroscience News, June 26, 2015
“Hippocampal place cells construct reward related sequences through unexplored space” by H Freyja Ólafsdóttir, Caswell Barry, Aman B Saleem, Demis Hassabis, and Hugo J Spiers, in eLife, June 26 2015
A Deeper Look at Experience Replay, Shangtong Zhang, Richard S. Sutton, 2017
Deep Reinforcement Learning Doesn’t Work Yet, Alex Irpan, 2018
Memory Augmented Policy Optimization for Program Synthesis with
Generalization, Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc Le, Ni Lao, 2018
Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision, Chen Liang, Jonathan Berant, Quoc Le, Kenneth D Forbus, Ni Lao, 2017