DON’T GUESS YOUR NEXT MOVE. PLAN IT!

Deep RL: a Model-Based approach (part 1)

Enrico Busto - EN

Published in

Analytics Vidhya

4 min readNov 30, 2020

Deep Reinforcement Learning doesn’t really work… Yet

Using Deep Reinforcement Learning (RL), we can train an agent to solve a task without explicitly programming it. This approach is so general that, in principle, we can apply it to any sequential decision-making. For example, in 2015, a research team developed a DRL algorithm called DQN to play Atari games. They use the same method across 57 different games, each with particular goals to achieve, with peculiar enemies, and different agent moves. Their agent learns to solve many games. In some cases, it reaches even better human-level performance.

Examples of some Atari games played by a DQN agent. Source: Synced — SLM Lab: New RL Research Benchmark & Software Framework

The DRL community achieved incredible results recently. For example, in 2016, Deepmind successfully trained a DRL agent to beat Go’s world champion.

In 2019 Open AI released OpenAI Five: the first AI able to beat the world champions in an e-sports game called Dota 2. In the same year, Open AI also trained a real-world robot hand to solve a Rubik’s cube.

Source: Youtube — Solving Rubik’s Cube with a Robot Hand: Uncut

Nevertheless, DRL also shows critical limitations. One of them is that the algorithms require too many interactions before learning a good strategy. This problem is called sample inefficiency. To better understand the problem, let’s see a practical example.

But first, try to guess: How much experience a DQN agent takes to achieve human performance in Atari games? (With the word “experience,” we mean every action performed in the game. Since the agent chooses a new action for each frame, we calculate its experience as the number of received frames.)

Let’s see the following plot to have some hints:

Source: Rainbow: Combining Improvements in Deep Reinforcement Learning, arXiv

In the above plot, we can see the result obtained by the original DQN architecture, represented with a gray line (ignore for the moment all the other algorithms). On the x-axis, we have the required frames (notice that they are hundreds of millions). In the y-axis, we have the “median human-normalized score,” which is the mean of all the 57 scores obtained from the corresponding Atari games, normalized in a way that the 100% match the human score. Deepmind repeated this process for each presented algorithm.

Answer: from this plot, we can see that the original DQN algorithm requires hundreds of millions of frames. It also shows that even if it overcomes human performance in some games, that’s not true in general.

Since 2015 the researchers proposed many improvements. Some of them are visible in the previous plot, each one with its different color. In 2017 some researchers from Deepmind showed how to combines all of them to achieve the best results. This new version of DQN is called Rainbow: it overcomes the original version just after 7 million frames, reaches the general human performance after 18 million frames, and overcomes all the other baseline after 44 million.

All the games provide 60 frames per second, so 18 million frames correspond to about 83 hours of the play experience. Note that 83 hours are only an approximation of the time spent to play; the full training required a lot more!

Considering that most humans required just a few minutes to gain confidence with an Atari game, 83 hours is a lot of time.

With more complex games, the situation is even worse: “OpenAI Five plays 180 years worth of games against itself every day, learning via self-play.”

For this reason, most recent DRL success stories are related to video games: researchers use virtual environments to speed up or even paralyze the training time.

Performing millions of experiments with a real-world robot in a reasonable amount of time and without hardware wear and tear is unrealistic.

The researchers use a simulator even for the real-world robot that solved the Rubik’s cube. In that case, they used the simulator to pre-train the agent. This approach could be one possible solution, but it is still an open research problem understanding how to transfer the learned policy reliably to the real world. Moreover, it is infeasible to build a simulator each time we need to train an agent for a task.

That’s why there is no way to train a real-world agent to solve non-trivial problems in a reasonable amount of time.

Source: YouTube — Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning

In the next article, Deep RL: a Model-Based approach (part 2), we will see how reinforcement learning works, and we’ll introduce the model-based approach to show how it improves the DRL sample efficiency.

**This article was written in collaboration with Luca Sorrentino

DON’T GUESS YOUR NEXT MOVE. PLAN IT!

Deep RL: a Model-Based approach (part 1)

Written by Enrico Busto - EN