Conquering OpenAI Retro Contest 2: Demystifying Rainbow Baseline

Published in

IntelligentUnit

6 min readApr 22, 2018

0 About OpenAI baselines

A Retro Demo played by Rainbow agent

In OpenAI’s tech report about Retro Contest, they use two Deep Reinforcement Learning algorithms Rainbow and PPO as baselines to test the Retro environment. They also provide the code. From the report we can find that Rainbow is a very strong baseline which can achieve a relatively high score without joint training (pre-trained on the training set):

This finding raises our curiosity about Rainbow. Can we do something based on it to improve the score?

Therefore, we will introduce the basics of Rainbow in this blog.

1 What is Rainbow?

Rainbow is a DQN based off-policy deep reinforcement learning algorithm with several improvements. Currently, it is the state-of-the-art algorithm on ATARI games:

In fact, Rainbow combines seven algorithms together:

(1) DQN (Deep Q-Network)

(2) DDQN (Double Deep Q-Network)

(3) N-Step Q-Learning

(4) Prioritized Experience Replay

(5) Dueling Q-Network

(6) Distributional RL

(7) Noisy Network

Let’s step-by-step analyze all above algorithms.

2 DQN

DQN is an extension of Q-learning algorithm by using a neural network as a representation of Q value. In the perspective of supervised learning, we have to train the Q network with a proper loss, say MSE loss:

So the question is how to construct y or target Q value? With Q-learning algorithm, we can construct it by utilizing next Q value:

Therefore, we can train the Q network by random sampling transitions (state,action,reward,next_state,done). In DQN, it creates a replay buffer to store transitions. The whole algorithm is shown below:

In addition to basic Q-network, they also initialize a target Q-network for more stable training. In this Nature DQN paper, they also provide a good Q-network architecture, which consists of 3 convolution blocks without pooling layers and 2 full connected layers:

3 Double Q-Learning

Q-Learning is a great off-policy RL algorithm, however, it also faces with an overestimation bias problem since it always use the maximum of Q values to update. Double Q-Learning handles this problem by replacing Max Target Q value with Target Q value of the action that maximize Q value:

This trick reduces the influence of overestimation issue and then improves DQN’s performance.

4 N-step Q-Learning

It is natural to calculate target Q values with N-step Returns, which are more accurate than 1-step Returns:

5 Prioritized Experience Replay

In the basic DQN algorithm , it stores all transitions in a replay buffer and then samples randomly from the replay buffer to train. This kind of mechanism suffers from low efficiency since many transitions are useless in fact. Therefore, it is natural to add priorities to transitions and then sample them with priorities.

The most simple priority is based on the Q Loss. The larger the loss, the more frequently such transitions should be sampled to train the network.

6 Dueling Networks

This idea changes Q-network’s architecture by using the relation between Value and Advantage: Q(s,a) = V(s) + A(s,a).

So it just changes the Q-network’s architecture to have two branches. One branch outputs Value, and the other branch outputs Advantages. And then it does a sum operation with a minus of the average Advantage to have final Q values:

The idea is pretty simple and effective which makes it win ICML 2016’s best paper award.

7 Distributional RL

This idea studies reinforcement learning in another perspective, and comes out with a super cool solution. As for current reinforcement learning algorithms, we always use an average estimated Q value as the target. However, Q values can be diverse in different situations, average Q values are not accurate. Check out DeepMind’s blog for more explanations in detail:https://deepmind.com/blog/going-beyond-average-reinforcement-learning/

The idea is that can we directly learn a distribution of Q values other than average Q values? They solve this problem by directly output dozens of Q values as Z but not one Q value, and then update the Q-network by distributional Bellman equation:

which is almost the same as basic Bellman equation by replacing Q with Z (a distribution of Q).

They choose to output 51 (a magic number)possible Q values in their algorithm, and they named this algorithm as C51 (so a fancy name)!

In order to compare the difference of distributions, they use KL divergence as the loss but not MSE loss.

8 Noisy Nets

One final issue of DQN is the exploration problem. Basic DQN implements a simple 𝜀-greedy mechanism to do exploration. This method is hand-designed and un-efficient to explore. Noisy Nets, on the other hand, combines final output linear layer of Q-network with a noisy stream. Q network can learn to ignore the noisy stream during training and then changes exploration rates automatically in different states.

9 Best practice for Rainbow

OpenAI’s Rainbow baseline is well implemented, which heavily relies on this open source RL framework: https://github.com/unixpickle/anyrl-py

The code is well organized and I highly recommend to read or use it.

The code structure for reinforcement learning is pretty clear:

(1) Env module: encapsulates all kinds of games with Gym, so we can easily to have interactions between the Agent and the Env.

(2) Agent module: Key module, implements the training process.

(3) Network module: define all kinds of network architectures for Q-network

(4) Replay Buffer module: store transitions

(5) Roll out module: do interactions between the Agent and the Env.

You can train Sonic games with Rainbow locally by changing below codes in sonic_util.py:

from retro_contest.local import make # train locally def make_env(stack=True, scale_rew=True): 
    #env = grc.RemoteEnv(‘tmp/sock’) # test on OpenAI server 
    # train locally here we can add a new method to automatically          load different game levels 
    # here we can only train a single game level. 
    env = make(game=’SonicTheHedgehog-Genesis’, state=’LabyrinthZone.Act1')

One last secret for everyone: You can achieve 4800+ score in the contest leaderboard by naively using the Rainbow baseline! So, there is no one that submits a better solution than Rainbow by now.