Implementing DQNClipped and DQNReg with Stable Baselines

AurelianTactics
aureliantactics
Published in
5 min readMar 2, 2021

For a mini-project I decided to implement some algorithms from the paper Evolving Reinforcement Learning Algorithms into code. The paper’s main idea is to evolve new Reinforcement Learning (RL) algorithms by representing the algorithm as a graph, allow various evolutions, and select the best performing ones.

Figure from the paper: https://arxiv.org/pdf/2101.03958.pdf

Some highlights from the paper for me:

  • The evolved algorithms can bootstrap from a known algorithm (like Deep Q Network (DQN)) or from scratch.
  • The algorithms can generalize to new environments they were not trained on.
  • The authors relate the best performing new algorithms, DQNClipped and DQNReg, to similar techniques used by Conservative Q Learning and Munchausen-DQN.
  • The paper uses some tricks to get evolution to perform better such as regularized evolution (drop oldest members from the population) and using quick CartPole trials as a fast evaluation method (evaluating the population is often a bottleneck in evolutionary approaches).

DQNClipped and DQNReg

The rest of this post is about implementing DQNClipped and DQNReg into an existing RL library, Stable Baselines 3 (SB). The two algorithms are simple modifications of DQN. Choosing which existing RL library might have been harder than doing the coding for DQNReg. There are dozens of RL libraries to choose from. This paper has a table mentioning 24 of them and that table isn’t even close to comprehensive. I chose SB since it’s well established, has active developers on it, and with the plans to do a pull request down the line.

Here are the basics of implementing the algorithms in SB. If I end up completing the pull request I’ll write a further post on that.

DQNClipped and DQNReg uses the following loss functions:

From the paper. Reward r, state s, action a, timestep t. Q is the Q network being trained by the DQN, Qtarg is the target network.

Basically that means replacing this line of DQN code with different loss functions:

loss = F.smooth_l1_loss(current_q_values, target_q_values)

DQNReg loss function:

import torch as thdef dqn_reg_loss(current_q, target_q, weight=0.1):
"""
In DQN, replaces Huber/MSE loss between train and target network
:param current_q: Q(st, at) of training network
:param target_q: Max Q value from the target network, including the reward and gamma. r + gamma * Q_target(st+1,a)
:param weight: scalar. weighted term that regularizes Q value. Paper defaults to 0.1 but theorizes that tuning this
per env to some positive value may be beneficial.
"""
# weight * Q(st, at) + delta^2
delta = current_q - target_q
loss = th.mean(weight * current_q + th.pow(delta, 2))

return loss

DQNClipped loss function:

def dqn_clipped_loss(current_q, target_q, target_q_a_max, gamma):
"""
:param current_q: Q(st, at) of training network
:param target_q: Max Q value from the target network, including the reward and gamma. r + gamma * Q_target(st+1,a)
:param target_q_a_max: Q_target(st+1, a) where a is the action of the max Q value
:param gamma: the discount factor
"""
# max[current_q, delta^2 + target_q] + max[delta,gamma*(target_q_a_max)^2]
delta = current_q - target_q
left_max = th.max(current_q, th.pow(delta, 2) + target_q)
right_max = th.max(delta, gamma * th.pow(target_q_a_max, 2))
loss = th.mean(left_max + right_max)

return loss

Here’s the repo of my implementation. The utils.py and dqn.py files are the modifications of the SB files that allow the two algorithms to be specified. Example usage and hyperparameters arein this file.

The paper reported the following results on four classic control environments: CartPole, LunarLandar, MountainCar, and Acrobot.

To test that my implementation works, I ran 5 random seeds on those environments for the three loss functions (standard DQN, DQNClipped, and DQNReg) and had similar results (see plots below). I copied the hyperparameters that were mentioned in the paper. For the hyperparameters not mentioned in the paper, I did my best guesses. Those hyperparameters were successful for all environments except for MountainCar. SB’s DQN did not work with my hyperparameter settings or the default settings so I used the hyperparameter settings from RL Zoo, (a repo which contains tuned and trained networks from SB agents).

Results from my implementation of DQNClipped and DQNReg

For doing a contribution to SB I still need to do a few things. The SB contrib repo has its own steps I’ll need to follow. I’ll try to do another post on that. I’ll probably also want to do some more trials to make sure my results match the paper’s results on more difficult environments. In the next post on this I’ll likely go over the contrib process and using a cloud service to spin up the trials.

Stuff I Wasn’t Sure About

As usual, there were some things I was uncertain about when trying to put the paper into code. Such as:

  • The full hyperparameters for the DQN
  • Whether the delta formula uses the current state or next state for the target Q network. The formula clearly says current state (which would be different than the next state used by DQN) but the language in a few different parts of the paper make me things it’s the next state (such as calling delta² ‘the normal DQN loss’).

--

--