Basic TRFL Usage: Target Network Updating

AurelianTactics
aureliantactics
Published in
3 min readJan 9, 2019
https://deepmind.com/blog/trfl/

4/26/19 Update: for more TRFL usage please see this post https://medium.com/aureliantactics/learn-reinforcement-learning-with-tensorflow-and-trfl-eaea1b8c2e2d

Continuing my blogs on TRFL — a Reinforcement Learning (RL) library developed by DeepMind — here’s a post on doing target network updates the TRFL way. TRFL is not an end-to-end RL library, rather TRFL contains useful RL building blocks. In my prior post I walked through a Colab notebook that did Double Deep Q Learning. In that notebook I had a cell for updating the target network with TensorFlow code. In this post I’ll use TRFL to do the target network updates. Then in further examples I’ll show the flexibility of TRFL by updating the target network a couple of other ways.

Target Network Updates

Colab notebook link

Target networks are used in Deep Learning to improve the stability of training. In the original Deep Q Network (DQN) paper, RL agents learn to play Atari at a human level only using screen images as inputs and using the same hyperparameters across 57 Atari games. The DQN algorithm trains two networks: the main training network and a target network. The loss the algorithm trains on is the squared difference between the two networks (often replaced with Huber loss nowadays). Every so many steps, the target network has its weights replaced by the main training network and training continues. Here’s the TensorFlow way to update the target network from Denny Britz’s excellent RL repo:

The TRFL way of doing the same update requires a few modifications. The trfl.update_target_variables(target_variables,main_network_variables,tau=1.0) method requires the main networks and target network’s variables. In the QNetwork I added a method to get those variables:

As in prior examples, using TRFL is as simple as calling the TRFL method in the graph and then running that operation in the session. In this example, call Then the trfl.update_target_variables() method in the graph. The method returns ops to run in sess.run():

The benefit of using TRFL is the flexibility that TRFL allows in the way you do the target network updates. TRFL has a tau parameter which lets you do Polyak averaging updates to the target network. Rather than replace the entire network weights every so many steps, in Polyak averaging you weight the two networks by tau and slowly update the target network towards the main training network. Ie don’t replace the full target network weights every 2000 steps (tau=1.0 in the last example) but instead average 1/2000 of the weights of the target network and 1999/2000 of the other network every step. Here’s the TRFL way of doing this:

#TRFL source code:
def update_op(target_variable, source_variable, tau):
if tau == 1.0:
return target_variable.assign(source_variable, use_locking)
else:
return target_variable.assign( tau * source_variable +
(1.0 - tau) * target_variable, use_locking)

Another commonly used method in updating target networks is to only update the target network every so often like in TD3. This can be done in TRFL by using the trfl.periodic_target_update(target_variables, main_network_variables, update_period, tau=1.0). Use trfl.periodic_target_update() in the graph and call the ops in sess.run(). Example:

--

--