Towards Reinforcement Learning Inspired By Humans Without Human Demonstrations

Leave any 12-year-old alone with an Atari video game for the afternoon and chances are, she will have mastered it before dinner. How do people learn to achieve high reward so quickly and how can we enable artificial agents to do the same? Some hypothesize that people learn and leverage structured models of how the world works (for example see [1,2]), models that represent the world in terms of objects rather than pixels, and that artificial agents could benefit from doing the same [3].

Inspired by such ideas, we present the Strategic Object-Oriented RL (SOORL) algorithm which is the first algorithm to our knowledge that can achieve positive rewards on the notoriously hard Atari game Pitfall! without access to human demonstrations, and can do so within 50 episodes. SOORL uses stronger prior knowledge (access to objects in the environment and a class of potential dynamics model) than standard deep RL algorithms, but much weaker information than methods that require access to trajectories of decent human play.

Snapshot of first three rooms in Pitfall!

SOORL goes beyond prior object oriented RL work by two key ideas:

  1. The agent actively tries to choose a simple model of how the world works that makes the world look deterministic.
  2. The agent uses an optimistic model-based planning approach for making decisions that explicitly assumes the agent will not computationally be able to compute a perfect plan for how to act even if it does know how the world works.

Both are inspired by the challenges faced by humans — given little experience and bounded computational capacity, humans must quickly learn to make good decisions.Towards this, our first idea observes that in contrast to sophisticated and data-intensive deep neural network models, simple deterministic models of what happens to a player’s agent if the player makes a particular keyboard press require little experience to estimate, reduce the computational cost of planning (since only one next state is possible) and though often wrong, may frequently be sufficient for achieving good behavior. Second, in sparse complex video games, game play can require hundreds to thousands of steps, and performing exact planning at the per decision level is intractable for any agent with a reasonably bounded amount of computation, including a 12 year old video gamer. We use a popular and powerful method for lookahead planning (Monte Carlo Tree Search) combined with object oriented optimism to do strategic optimistic exploration and guide the agent towards learning about parts of the world it knows little about.

As a challenge problem we consider Pitfall!, perhaps the hardest Atari video game left for artificial agents. The first positive rewards in Pitfall! happen after multiple rooms which are reached only after careful manipulation, making it important to both strategically explore and think about things far into the future when making decisions.

Our SOORL agent was able to reach an average of 17 rooms in Pitfall! in 50 episodes (out of 100 runs) compared to DDQN [6], a strong baseline that uses pixel input and no strategic exploration, which reached an average of 6 rooms after 2000 episodes.

SOORL discovers 17 rooms on average and 25 rooms in the best run

A histogram below shows the distribution of the best episode performance during training (a training run goes only up to 50 episodes) for each of 100 SOORL runs with different random seeds.

Histogram of the best episode performance of 100 different runs (best during the first 50 episodes of each run)

As can be seen, SOORL most often scores no better than all prior deep RL methods which get at best a reward of 0 (though such methods frequently achieve this even after 500 or 5000 episodes, compared to our 50 episodes). In such cases, SOORL often explores much further (reaching more rooms) than alternate approaches but does not reach better best episode scores (as compared to evaluation runs of the alternate approaches). However, in several runs SOORL reaches immediate rewards of 2000 (in room -17) and rewards of 4000 (in room 6), thus achieving to the best of our knowledge the first positive scores on this game when learning without demonstrations. The best results known with human demonstrations are substantially higher (60k) [4], but while very exciting, this work requires substantially more prior knowledge and in particular, significantly decreases the exploration challenge by providing a trusty worked example to build on.

Below are examples of interesting maneuvers learned by the SOORL agent.

Agent getting past different types of obstacles: vine (left), crocodiles (middle), and shifting quicksand (right).

SOORL still has many limitations. Perhaps the most immediately critical shortcomings is that it requires a class of reasonable potential dynamics models to be specified (SOORL then performs model selection over this set), and it does not learn and leverage a value function to use during tree search, which was an important part of early world-class Go artificial intelligent players [5]. We expect incorporating a value function will be crucial for improved and consistent performance.

Yet despite these weaknesses, these results are also an exciting demonstration that a model-based RL agent can quickly learn to act well in extremely challenging sparse reward video games like Pitfall! by strategically planning to learn more about how a, assumed to be simple, world works in order to make good decisions.

A preliminary version of our paper is available here.

The below video clip shows the SOORL agent reaching the gold treasure (a 4000-point bonus):

Video clip of the agent getting past obstacles to get to the Gold reward.

References

[1] Tsividis, Pedro A., Thomas Pouncy, Jacqueline L. Xu, Joshua B. Tenenbaum, and Samuel J. Gershman. “Human Learning in Atari.” In Proceedings of the AAAI Spring Symposium on Science of Intelligence (2017)
[2] Lake, Brenden M., Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. “Building Machines that Learn and Think Like People.” Behavioral and Brain Sciences 40 (2017).
[3] Diuk, Carlos, Andre Cohen, and Michael L. Littman. “An Object-Oriented Representation for Efficient Reinforcement Learning.” In Proceedings of the International Conference on Machine Learning (ICML) 2008.
[4] Aytar, Yusuf, Tobias Pfaff, David Budden, Tom Le Paine, Ziyu Wang, Nando de Freitas. “Playing Hard Exploration Games by Watching YouTube”, arXiv:1805.11592
[5] Silver, David, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert et al. “Mastering the Game of Go Without Human Knowledge.” Nature 550, no. 7676 (2017): 354.
[6] Van Hasselt, Hado, Arthur Guez, and David Silver. “Deep Reinforcement Learning with Double Q-Learning.” In Proceedings of the Conference on the Association for the Advancement of Artificial Intelligence (AAAI) 2016.

--

--