Deep Q-learning from Demonstrations (DQfD) in Keras

AurelianTactics
aureliantactics
Published in
7 min readJul 16, 2018

In an earlier post, I wrote about a naive way to use human demonstrations to help train a Deep-Q Network (DQN) for Sonic the Hedgehog. After that mostly unsuccessful attempt I read an interesting paper called Deep Q-learning from Demonstrations. The paper has a ‘Related Work’ section that discusses many other attempts to use human demonstrations with DQN and how the DQfD method compares. I decided to try to put the DQfD paper into code. I had never put a paper into code. It was an interesting learning experience.

I chose Keras because I saw examples of DQfD implemented in tensorflow (most notably TensorForce) and wanted to push myself to do as much of my own DQfD code as possible and not read or copy off another implementation. If you want an out-of-the-box implementation of DQfD to use on your own project I recommend using someone else’s implementation. While I did get a working model up and running, I made some assumptions when I wasn’t quite clear on what the paper did.

To others trying to put a paper into code, I don’t have any ground-breaking advice. I watched a Siraj video called ‘Research to Code’ as I was finishing up this project. His advice is basically ‘Carefully read the paper from start to finish as many times as needed.’ I guess that’s the key. Read carefully. Read repeatedly. I took notes with questions of things I was unsure of. I made a work flow list to help organize how things should be done and identify trouble spots.

One thing I hadn’t expected coming in — and that is helpful to know going forward — is how much reading of cited papers is needed. While I only intended to implement one paper into code, there were times when I had 4 papers open and was trying to better understand and code parts of those papers into this project. A simple line like:

…which we calculate using the forward view, similar to A3C (Mnih et al. 2016).

means it time to open up another paper and carefully dig into that research too (or hopefully find a code implementation of that work).

DQfD at a Glance

  • Pre-train the network with expert demonstrations.
  • Store the demonstrations into an expert replay buffer.
  • Agent gathers more transitions for new replay buffer. Trains network on mixture of new replay buffer and expert replay buffer.
  • Network trains with special loss function made up of four parts: normal DQN loss, n-step DQN loss, L2 regularization and expert large margin classification loss

Those are the key points in my opinion. It’s hard to summarize a paper without getting too lost in jargon or making the summary too long. I think that’s a reasonable distillation of the research. I believe there is some flexibility in the DQfD method as a recent paper (Observe and Look Further) used DQfD without pre-training the network and used fixed sampling between the expert and new replay buffer (while DQfD uses prioritized sampling).

DQfD in Keras

See my github repo here. I have two python scripts, which probably should be broken out into more files. One of the scripts is for Prioritized Experience Replay (PER) which allows you to sample more efficiently from the replay buffer (the original DQN paper randomly samples from the replay buffer). The script is from the any-rl implementation with a couple modifications to allow saving and loading the replay buffer with some pickle dumps.

Let’s walk through the keras_dqfd.py script at a high level. I am doing this implementation on OpenAI’s Retro Gym, specifically Sonic the Hedgeghog. The code is flexible enough to work on any Retro Gym game and with some modifications to work on different gym environments. Basic code flow is:

  • Create the environment
  • Turn human demonstrations (created with retro movies) into transitions for the expert replay buffer
  • Pre-train the network
  • Create a Reinforcement Learning (RL) agent, gather more transitions for a new replay buffer, train the agent using the two replay buffers.

Starting with if __name__ == ‘__main__’ section you’ll notice that the code doesn’t quite flow this way. In Retro Gym you can only create one environment. In this case I needed two slightly different environments. The one for expert training does not use stochastic sticky keys while the one for RL agent training does use stochastic sticky keys. Sticky keys means that there is a 25% chance that the agent will continue that action into an additional frame, thus adding some stochasticity into a deterministic environment. My work-around for this is to allow a call to this script do one of two things: either gather the expert demos and pre-train the network or do the RL agent training. In the latter case, the pre-trained network and expert replay buffer are loaded.

Turning the human demonstrations into the expert replay buffer is done by means of fill_expert_buffer, parse_demo, and add_transitions functions. fill_expert_buffer locates the human demonstration files and passes them to parse_demo which goes over them frame by frame. add_transitions is used in parse_demo and in the RL agent training to take the states, rewards, next states etc and put them into a PER buffer. Feel free to ask any questions.

Some of the functions rely on some Sonic or Retro Gym specific functions:

  • make_env: creates the environment
  • SonicDiscretizer: limits the action space the RL agent can choose from
  • AllowBacktracking: shapes reward so reward is only gained by getting to max distance rather than penalizing Sonic for backtracking in a level.
  • define_action_dict: when parsing human demonstrations, turns those actions into the same action keys later used by SonicDiscretizer.

After parsing the human demonstrations, training the expert network begins. train_expert_network is for training the expert network while train_network is for training the RL agent. Both feed samples from replay buffers into inner_train_function. The difference being the train_network function does a mixture of expert and gathered samples (I fix this like the ‘Observe and Look Futher’ paper while the DQfD paper does this based on PER priority) and has some additional code that allows for the RL agent to gather more samples. inner_train_function implements Double DQN and updates the PER replay buffers with their new values.

The build_model function is where the Keras happens. A simple DQN Convolutional Neural Network (CNN) is augmented with Dueling DQN and the four losses from DQfD. The L2 regularization loss is added by specifying L2 loss in the network layers:

layer_1 = Conv2D(...,                     kernel_regularizer=regularizers.l2(l2_reg),                     bias_regularizer=regularizers.l2(l2_reg))(scale_img)

The DQ loss (basic loss from a DQN) and the n-step loss are a bit trickier to implement in Keras. They both use the same Dueling DQN layers. This would be straightforward if they used the same model but different weights or were concatenated together and jointly fed into the model. However they need to be treated as separate inputs that share the same layers. The Keras docs detail how to do this in the ‘Shared Layers’ section.

I set up the Dueling DQN and have a input_img layer as an entry point. input_img is then fed into the Dueling DQN and outputs cnn_output. I encapsulate this into a model which can be shared (called cnn_model). The Dueling DQN network:

input_img = Input(shape=(img_rows, img_cols, img_channels), name='input_img', dtype='float32')
#... the CNN layers here, dueling DQN part etc
cnn_output = Lambda(dueling_operator, name='cnn_output')([dueling_values, dueling_actions])
#the Dueling DQN is encapsulated into a model
cnn_model = Model(input_img, cnn_output)

Then for the DQ loss and the n-step loss, I create inputs for those two paths that combined with a call to cnn_model, cause those inputs to feed into input_img and thus share layers of the same model. The calls to share layers:

#inputs that are fed into the Keras model
input_img_dq = Input(shape=(img_rows, img_cols, img_channels), name='input_img_dq', dtype='float32')
input_img_nstep = Input(shape=(img_rows, img_cols, img_channels), name='input_img_nstep', dtype='float32')
#share layers of the same model
dq_output = cnn_model(input_img_dq)
nstep_output = cnn_model(input_img_nstep)

Outside build_model in the inner_train_function, when I call to train the model I feed in the DQ images and the n-step images as inputs. Keras takes the DQ images in the input_img_dq line and then feeds the images into the cnn_model’s input_img and they run through the Dueling DQN network (cnn_model). This is repeated with the n-step images being fed into input_img_nstep.

That leads us to our fourth and final loss: the supervised large margin classifier (SLMC) loss. The SLMC loss requires the outputs from the DQ loss (see the dq_ouput line). This loss is only for transitions sampled from the expert replay buffer. It adds a margin to actions that weren’t the actions that the expert took. This was the trickiest loss for me to implement because I couldn’t find the proper way to slice the tensors in Keras (I needed something like the np.arange calls I make in numpy). I used one line of tensorflow (tf.nd_gather) to take the dq_output and only extract the parts of the tensors that align with the expert actions:

exp_val = tf.gather_nd(sa_values, exp_act) #so close to Keras only

The model then takes a combination of the two image batches (DQ images and n-step images), three arrays dealing with expert inputs (for the SLMC loss) and outputs DQ predictions, n-step predictions and the SLMC loss.

model = Model(inputs=[input_img_dq, input_img_nstep, input_is_expert, input_expert_action, input_expert_margin],                  outputs=[dq_output, nstep_output, slmc_output])

I trained the network on the first level of Sonic 1. While the DQfD agent was not nearly as successful as the Rainbow DQN implementation (the DQfD agent struggles at the loop) it is able to complete the level with no hyperparameter search. I used the same hyperparamters used in the DQfD paper and Rainbow DQN implementation. To achieve similar performance or possibly outperform the Rainbow DQN, some hyperparameter search would likely be beneficial.

DQfD agent completes first level of Sonic

--

--