Tensorflow Implementation of TD3 in OpenAI Baselines

Published in

aureliantactics

7 min readOct 26, 2018

When I’m looking for new research papers to read, it’s often hard to tell what is worth reading. How reproducible are the results? Will this paper actually have a lasting impact in the field of Reinforcement Learning (RL)? With those concerns in mind, it is nice when a paper like the Twin Delayed Deep Deterministic Policy Gradient algorithm (TD3) is written, used by other RL researchers, and released with code. The algorithm came to my attention by being used in the HIRO paper and a Berkely AI Research paper/blog post. The authors’ code is written in Pytorch. I’ve done a TensorFlow implementation of TD3 that is an extension of the OpenAI baselines version of Deep Deterministic Policy Gradient (DDPG).

To those of you familiar with deep RL, a one line description of TD3: ‘TD3 does for DDPG what Double Q learning does for DQN.’ Let’s unpack this line a bit in an unnecessarily long blog post.

Background

In Q learning, the RL agent tries to find the optimal actions to take at a given state. The agent repeatedly interacts with the environment trying different actions and reassesses the value of the state, action pair by the rewards the agent receives. The Q learning formula:

https://en.wikipedia.org/wiki/Q-learning s is the state, a is the action at time t

In tabular Q learning every state, action pair is given a value and the agent is able to visit each state, action pair frequently enough to assign accurate values. The RL agent can then solve the problem by taking the optimal action in every state. However in most problems the state, action space is too large for our limited time and computational resources. Instead we use a function approximator (often a neural net in deep RL) to represent the state, action space.

That leads to an issue in Q learning: how certain are we that the Q learning values are correctly showing the optimal state, action pair? We want an action gap between the optimal action and other actions. Function approximation error can keep us from distinguishing between optimal actions and actions that only look good. Double Q learning helps Deep Q Networks mitigate this. The authors of TD3 explain why this doesn’t work for actor-critic methods like DDPG and propose TD3 as a method that does work.

TD3

My main takeaways from TD3:

Function approximation error can be mitigated by having a critic-pair network and critic-pair target network each of which have two outputs. TD3 uses the use the minimum value of the critic-pair target network to prevent overly optimistic estimates.
Delay updates. Delay policy network (the actor network) updates until the value estimate has converged. Ie update the actor network less frequently than the critic network. Delay target network updates too.
Reduce variance with replay buffer action noise regularization inspired by expected SARSA.

Critic-Pair Network and Loss

The critic-pair network full code and relevant gist:

Note that the difference between using the TD3 variant of the DDPG critic network is that the TD3 network critic-pair network has twice the layers and outputs. These critic-pair layers run in parallel to each other and result in two separate outputs.

Aside: recall that this is an extension of OpenAI baselines so we can use their **network_kwargs syntax in the self.network_builder to make neural nets of various sizes and hidden layers (like ‘num_layers=2’ and ‘num_hidden’ = 300 to add two fully connected layers of 300 hidden units).

When training the TD3 critic-pairs, the DDPG only needs a few modifications. The full code is here and a gist highlighting the important parts compared to base DDPG is below.

In DDPG we have a critic network, a critic target network, an actor network, and an actor target network (TD3 replaces the critic network with a critic-pair network and the critic target network with a critic-pair target network). We use the critic target network to create targets for training our critic loss. The critic target network is used to find the next state value (see the part labelled ‘learned value’ in the Q learning formula above). The key lines are 3, 10, 11,12, and 22 (OpenAI baselines has some additional code to implement PopArt and parameter noise which complicates the code below a bit).

The parts that are ‘if self.td3_variant’ is the TD3 code, the ‘else’ is the base DDPG code. Line 3 has the two outputs of the critic-pair network. Those are the predicted values of the critic-pair network and used with line 22 to compute the critic-pair network loss (gist/explanation below). Lines 10–12 are the main contribution of TD3. Function approximation error is lowered by taking the minimum of the output of the critic-pair target network. That minimum value is used in line 22 as the target values for the critic loss.

Now that we have critic pair with two outputs instead of a single output, how is the loss handled? Take add the mean squared error of self.target_Q (line 22 up above) with both critic-pair networks outputs:

self.normalized_critic_tf and self.normalized_critic_tf2 are the critic-pair network outputs. normalized_critic_target_tf is the normalized value of self.critic_target which is a placeholder for self.target_Q.

Delayed Policy and Target Network Updates

The second key point of TD3 is a hyperparameter that allows delayed updates. in two places. One delay is how frequently the actor network is updated compared to the critic-pair network. Full code and relevant gist:

The second delay is how frequently the actor target network and critic-pair target networks are updated. Updating target networks with the trained network was one of the key parts of the original DQN paper. That paper used hard updates where the target network parameters were fully replaced by the trained network parameters every x steps. Here we use soft updates where the target network parameters are gradually updated. The target networks slowly take the values of the trained networks based on the parameter tau (tau is between 0 and 1). At each update step, the target networks retain 1-tau of their current parameter values and gain tau of the trained network parameter values. Full code and relevant gist:

The delayed update can be run with base DDPG and does not require a TD3 critic-pair model be used.

Replay Buffer Action Noise Regularization

The third and final key part of the paper is the TD3 regularization method. When training the critic, samples are taken from the replay buffer. The TD3 regularization takes the stored action values from the replay buffer, adds some noise to the action and then trains with the noisy action. The idea from the paper being that “similar actions should have similar values.” TD3 adds optional hyperparameters for the standard deviation and clip range of this Gaussian noise. TD3 regularization can be run with base DDPG and does not require a TD3 critic-pair model be used. Full code and relevant gist:

The else is base DDPG. Lines 2 and 3 create and clip the noise based on the hyperparameters and on line 8 the noise is added to the batch[‘actions’] sampled from the replay buffer.

Aside: There are other ways noise can be used in baselines’ DDPG. Normal or OU noise can be added to actions to aid exploration. When selecting actions to test in the environment, action space noise can be added to the policy actions. Orthogonally, parameter space noise can be added to the parameter values of the policy network to also help with exploration.

Results

Ideally I would run OpenAI’s baselines DDPG, the authors’ TD3 implementation, and my implementation of TD3 on the Mujuco environments mentioned in the paper. I’d run experiments across multiple seeds, perform hyperparameter searches, ablate various parts (how do the three TD3 parts perform independently?), and test untried interactions (how does PopArt/parameter noise function with the three TD3 parts?). However I do not have a Mujuco license and performing the full experiment properly would take a substantial investment of time and computing power.

Instead I tested my TD3 implementation on OpenAI gym’s Pendulum and MountainCarContinuous. Pendulum is not a solvable game but I considered it essentially ‘solved’ when average episode rewards where greater -200. MountainCar solved when scores greater than 90. Some rough results follow that do not include multiple random seeds, hyperparameter search, or an attempt to compare like hyperparameters to like. These are simply some results to show that all three implementations work and to note that hyperparameter search is necessary to get TD3 delayed updates and TD3 regularization to improve results.

Baselines DDPG solves Pendulum in roughly 32,000 steps and the authors’ TD3 in roughly 10,000 steps. My TD3 implementation about 62,000 steps. MountainCar was solved by baselines DDPG in about 36,000 steps, not solved in authors’ TD3 default settings (though solved with highly unoptimized hyperparameters in roughly 300,000 steps), and my implementation in 32,000 steps.

My implementation of TD3 performed best when using the TD3 network and about the same when optionally using the delayed updates. The unoptimized TD3 regularization did not help. Performance was significantly worse when the hyperparameters were too large. Hyperparameter optimization is necessary to get best results with TD3 delayed update and TD3 regularization and to tell if those two optional methods help results at all.