Implementation of DQN,Double DQN and Dueling DQN with keras-rl 2020
check out for full implementation with code:
Double Q-learning
Another augmentation to the standard Q-learning model we just built is the idea of Double Q-learning, which was introduced by Hado van Hasselt (2010, and 2015). The intuition behind this is quite simple. Recall that, so far, we were estimating our target values for each state-action pair using the Bellman equation and checking how far off the mark our predictions are at a given state, like so:
However, a problem arises from estimating the maximum expected future reward in this manner. As you may have noticed earlier, the max operator in the target equation (yt) uses the same Q-values to evaluate a given action as the ones that are used to predict a given action for a sampled state. This introduces a propensity for overestimation of Q-values, eventually even spiraling out of control. To compensate for such possibilities, Van Hasselt et al. (2016) implemented a model that decoupled the selection of actions from the evaluation thereof. This is achieved using two separate neural networks, each parametrized to estimate a subset of the entire equation. The first network is tasked with predicting the actions to take at given states, while a second network is used to generate the targets by which the first network’s predictions are evaluated as the loss is computed iteratively. Although the formulation of the loss at each iteration does not change, the target label for a given state can now be represented by the augmented Double DQN equation, as shown here:
As we can see, the target network has its own set of parameters to optimize, (θ-). This decoupling of action selection from evaluation has shown to compensate for the overoptimistic representations that are learned by the naïve DQN. As a consequence, we are able to converge our loss function faster while achieving a more stable learning.
In practice, the target networks weights can also be fixed and slowly/periodically updated to avoid destabilizing the model with bad feedback loops (between the target and prediction). This technique was notably popularized by yet another DeepMind paper (Hunt, Pritzel, Heess et al. , 2016), where the approach was found to stabilize the training process.
The DeepMind paper by Hunt, Pritzel, Heess et al., Continuous Control with Deep Reinforcement Learning, 2016, can be accessed at https://arxiv.org/pdf/1509.02971.pdf.
You may implement the Double DQN through the keras-rl module by using the same code we used earlier to train our Space Invaders agent, with a slight modification to the part that defines your DQN agent:
double_dqn = DQNAgent(model=model,
nb_actions=nb_actions,
policy=policy,
memory=memory,
processor=processor,
nb_steps_warmup=50000,
gamma=.99,
target_model_update=1e-2,
train_interval=4,
delta_clip=1.,
enable_double_dqn=True,
)
All we simply have to do is define the Boolean value for enable_double_dqn to True, and we are good to go! Optionally, you may also want to experiment with the number of warm up steps (that is, before the model starts learning) and the frequency with which the target model is updated.
Dueling network architecture
The last variation of Q-learning architecture that we shall implement is the Dueling network architecture (https://arxiv.org/abs/1511.06581). As the name might suggest, here, we figuratively make a neural network duel with itself using two separate estimators for the value of a state and the value of a state-action pair. You will recall from earlier in this chapter that we estimated the quality of a state-action pairs using a single stream of convolutional and densely connected layers. However, we can actually split up the Q-value function into a sum of two separate terms. The reason behind this segregated architecture is to allow our model to separately learn states that may or may not be valuable, without having to specifically learn the effect of each action that’s performed at each state:
At the top of the preceding diagram, we can see the standard DQN architecture. At the bottom, we can see how the Dueling DQN architecture bifurcates into two separate streams, where the state and state-action values are separately estimated without any extra supervision. Hence, Dueling DQNs use separate estimators (that is, densely connected layers) for both the value of being at a state, V(s), as well as the advantage of performing one action over another, at a given state, A(s,a). These two terms are then combined to predict Q-values for given state-action pair, ensuring that our agent chooses optimal actions in the long run. While the standard Q function, Q(s,a), only allowed us to estimate the value of selecting actions for given states, we can now measure both value of states and relative advantage of actions separately. Doing so can be helpful in situations where performing an action does not alter the environment in a relevant enough manner.
Both the value and the advantage function are given in the following equation:
DeepMind researchers (Wang et al, 2016) tested such an architecture on an early car racing game (Atari Enduro), where the agent is instructed to drive on a road where obstacles may sometimes occur. Researchers noted how the state value stream learns to pay attention to the road and the on-screen score, whereas the action advantage stream would only learn to pay attention when specific obstacles would appear on the game screen. Naturally, it only becomes important for the agent to perform an action (move left or right) once an obstacle is in its path. Otherwise, moving left or right has no importance to the agent. On the other hand, it is always important for our agent to keep their eyes on the road and at the score, which is done by the state value stream of the network. Hence, in their experiments, the researchers show how this architecture can lead to better policy evaluation, especially when an agent is faced with many actions with similar consequences.
We can implement Dueling DQNs using the keras-rl module for the very same Space Invaders problem we viewed earlier. All we need to do is redefine our agent, as shown here:
dueling_dqn = DQNAgent(model=model,
nb_actions=nb_actions,
policy=policy,
memory=memory,
processor=processor,
nb_steps_warmup=50000,
gamma=.99,
target_model_update=10000,
train_interval=4,
delta_clip=1.,
enable_dueling_network=True,
dueling_type='avg'
)
Here, we simply have to define the Boolean argument enable_dueling_network parameter to True and specify a dueling type.
check out for full implementation with code: