Basic TRFL Usage: Q-Learning and Double Q-Learning

AurelianTactics
aureliantactics
Published in
3 min readDec 31, 2018

4/26/19 Update: for more TRFL usage please see this post https://medium.com/aureliantactics/learn-reinforcement-learning-with-tensorflow-and-trfl-eaea1b8c2e2d

Earlier this year DeepMind released a Reinforcement Learning (RL) package called TRFL. TRFL does not have end to end RL algorithms, rather the packages contains “building blocks for implementing Reinforcement Learning agents.” The docs pages has a summary of TRFL’s contents from the basics like Q-learning to the more advanced methods like V-trace. Even if you aren’t interested in learning to use a new library, I recommend skimming the docs page. The page contains a good summary and links to relevant research papers and sections from Sutton & Barto’s RL book. It can be a good refresher on things you may have already read or give you some new papers to check out. Personally, I reread a few sections like Q-lambda and multistep forward view to get a better understanding and learned about Persistent Q-learning for the first time.

This post goes over some basic TRFL usage. In a series of three notebooks I apply Q-learning in the tabular case, deep Q-learning, and double Q-learning. All code is available on Colab notebooks.

Tabular Q-Learning

Colab notebook link

As a refresher, tabular Q-learning is Q-learning in discrete state space. In this notebook I took the OpenAI gym CartPole environment and turned the observation space into a number of discrete bins. This allows you to make a q-table estimate of the action,value combinations and apply Q-learning updates to update the table:

https://en.wikipedia.org/wiki/Q-learning

I indexed the q_table by observation_1 (21 bins), observation_2 (41 bins), and action (2 bins). In code:

q_table[obs_vel_index,obs_angle_index,action] = q_table[obs_vel_index,obs_angle_index,action] \
+ alpha *(reward + gamma*max_q_value - q_table[obs_vel_index,obs_angle_index,action])

The same thing can be done using trfl.qlearning. The linked notebook contains install instructions, the TRFL example from the github repo, and optional cells to render CartPole as the environment runs. The last three cells show how to do tabular Q learning in TRFL. This isn’t a very practical or likely intended usage for TRFL since TensorFlow is overkill for this example. Still tabular Q-learning can be done with TRFL as trfl.qlearning exposes the TD-error (part labelled ‘learned value’ in the image above). First step is to set up your TensorFlow graph with trfl:

Second part is to run the TensorFlow session to get the TD-error and update the q-table with that value.

Deep Q-Learning

Colab notebook link

This is a more practical look at what TRFL can do since most deep Q-learning will need a function approximator, like a neural net in TensorFlow which TRFL works well with. Using CartPole, I modified a publicly available Deep Q-learning tutorial with TRFL usage. Again, implementing TRFL was as simple as setting up the graph:

And modifying the session:

These examples are admittedly simple and maybe a bit overkill. I think the real advantage of TRFL will come from the package’s reliability and when using more advanced techniques. RL is notoriously hard and having a reliable building block that you don’t have to unit test or worry about can make running experiments easier.

Double Q-Learning

Colab notebook link

In double Q-learning, you use two Q-networks to find a better estimate of the next state. Rather than take the max of the next state (Q-learning), you take the argmax action from one network and use that action to get a Q-value from another network. Deep Q Network (DQN) and Double DQN example:

  • DQN: target = reward(s,a) + discount*max(Qtarget(s’,a’))
  • DDQN: target = reward(s,a) + discount*Qtarget(s’,argmax(Qtrain(s’,a’))

This reduces the overestimation error. See the Double DQN paper for more details. Double Q-learning with TRFL only required a few modifications. I changed the trfl method call from qlearning to double_qlearning and added in one more parameter (the Q-values to select the argmax from):

--

--