Introduction to Deterministic Policy Gradient (DPG)

Published in

Geek Culture

12 min readAug 26, 2021

In this post, I will be exploring the concepts following the paper Deterministic Policy Gradient Algorithms (Silver et al.), implementing the algorithm COPDAC (Compatible Off-Policy Deterministic Actor-Critic) proposed in the paper, and training the agent on the continuous-control environment MountainCar.

Introduction

In the paper Deterministic Policy Gradient Algorithms, Silver proposes a new class of algorithms for dealing with continuous action space. The paper derives the deterministic policy gradient theorem from first principles. An advantage of the deterministic policy gradient is that compared to the stochastic policy gradient, the deterministic version is simpler and can be computed more efficiently. The paper then proposes on-policy and off-policy Actor-Critic algorithms that make use of the deterministic policy gradient. The paper builds off several concepts I’ve written about such as Policy Gradients, Actor-Critic methods, and Off-Policy Actor-Critic, so I will not go into detail about them as they come up.

Deterministic Policy Gradient Theorem

Similar to the stochastic policy gradient, our goal is to maximize a performance measure function J(θ) = E[r_γ |π], which is the expected total discounted reward following policy π, where θ is the parameters of the function we are updating. Here is an expanded version of the performance measure function where we sum up all rewards over the state distribution ρ and actions in each state.

Taken from Determinist Policy Gradient Algorithms (Silver et al. 2014)

For a deterministic policy, this is a bit different. Unlike a stochastic policy, we define our policy to be the function μ_θ: S → A, where μ takes in the state space S and outputs an action space A following parameters θ. Here, instead of getting the integral over actions, we only need to sum over the state space as action is deterministic. We can then write this as an expectation, where we sample a state from the state distribution ρ.

The majority of model-free learning algorithms are based on policy iteration, where the general framework consists of policy evaluation and policy improvement. Policy evaluation involves methods such as estimating an action-state function Q and policy improvement methods usually involve updating the policy with respect to this action-state function. An example of this is when we are building a policy for Q-learning. After learning a policy evaluation function Q, we build a policy using argmax to select the action with the highest Q value.

For the deterministic policy, we will be doing something similar. Since our action space is continuous, we will move our policy in the direction of the Q function’s gradient. Instead of maximizing Q by taking the argmax action, the policy will converge towards a maxima Q value. We can write this gradient as an expectation:

Then, we can apply the chain rule once so we can see the gradient of Q is with respect to action a, and the gradient of μ is with respect to parameters θ:

Here’s the expanded form of the deterministic policy gradient:

The proof for this deterministic policy gradient is similar in structure to the proof for the policy gradient theorem detailed in (Sutton et al. 1999). Looking at the deterministic policy gradient, you may notice that it is a bit similar to the stochastic policy gradient. The paper introduces a theorem that shows that the deterministic policy gradient is really just a special case of the stochastic policy gradient.

Recall the policy gradient theorem:

We can parameterize the policy function π with the deterministic policy μ_θ and variance σ, similar to that of a policy parameterization for continuous actions. The theorem states that as σ approaches 0, π = μ.

The proof for this theorem can be found in Appendix C of the paper. This theorem is important as it shows that other techniques such as compatible function approximation, natural policy gradients, and experience replay can be applied to deterministic policy gradients.

Deterministic Policy Gradient Algorithms

With the deterministic policy gradient, we can derive different kinds of algorithms such as Actor-Critic methods for both on-policy and off-policy. The paper beings with a simple on-policy algorithm that uses SARSA to update the Critic. Similar to stochastic Actor-Critic methods, we have an Actor that updates the policy, which in this case is deterministic, and a Critic, which will approximate the true action-state function Q by learning a set of parameters. Here are the update equations for the algorithm:

Similar to the on-policy algorithm, we can also use the deterministic policy gradient for an off-policy algorithm. Instead of using trajectories sampled from the policy, we can use a stationary behavior policy to get our actions. We can derive the expectation of the performance measure gradient in a similar way to a stochastic off-policy Actor-Critic which I wrote about in my previous post:

This is the same expectation as the on-policy gradient, except the states are sampled from a separate behavior policy β. Note that the policy gradient does not use an importance weight since we don’t take the integral over actions. We can also avoid the importance weight for the Critic by using Q-learning:

In the two algorithms that were proposed above, there is still a big problem. We substituted the action-state function in the deterministic policy gradient with an approximation, but this does not guarantee that the approximated gradient will follow the true gradient. The paper then proposes a class of compatible function approximators to make sure that the approximation will follow the true gradient.

Here is the proof for the theorem:

First, if w minimizes MSE, then the gradient of MSE with respect to w is 0. We know this because if w minimizes the error, then w must be at a minima, therefore, the gradient is 0. Then, we have the following:

For lines (1) to (3), we expand the gradient of MSE. Then, we show that ∇_w ϵ(s, w) = ∇_θ μ(s, θ). We expand the gradient of ϵ in (4) and (5), then use condition 1 of the theorem for (6).

Now, we can plug in the ∇μ in (8) and expand the ϵ in (9). Then, we distribute the ∇μ in (10) and show that the deterministic policy gradient theorem is compatible using the approximator Q_w following the two conditions. Note that the compatible function approximator works for off-policy too!

The paper then proposed a form for the approximator Q_w that will satisfy the two conditions of the theorem.

Here, we have the base form for the action-state function. The components of the function are simple: the advantage function A estimates the advantage of choosing action a over μ(s) and the value function V is the value of the state s. For the value function, we can just use a linear approximator with parameters v and state features ϕ(s).

For the advantage function, we can also use a linear approximator with parameters w and define the action-state features ϕ(s, a).

The action-state features here are proportional to the advantage of taking action a over μ(s), and also satisfy condition 1 of the compatibility theorem. Here is the complete form for the approximator:

Note that given an m action space size and n policy parameters, ∇μ(s) will be a Jacobian matrix, ϕ(s) will be an n × 1 vector, and w will also be an n × 1 vector. Now, we must satisfy condition 2 of the theorem, w must minimize the mean-squared error between the gradient of Q_w and the true gradient. The problem here is similar to that of learning a value function. Since sampling unbiased values of the true gradient are too hard, we will just approximate the true gradient with approximation methods such as SARSA or Q-learning. The paper does acknowledge that although an approximation of the true gradient will not exactly minimize the mean-squared error, the approximation can still approximately satisfy the second condition.

With all this, the paper proposes a compatible off-policy deterministic Actor-Critic algorithm (COPDAC-Q) that uses a Q-learning critic:

A problem you may notice with this is that off-policy Q-learning tends to diverge. An improvement we can do is by using gradient temporal-difference methods that can guarantee convergence in an off-policy setting, the same thing we did for Off-Policy Actor-Critic. The paper then proposes another algorithm, COPDAC-GQ that uses a Gradient Q-learning critic:

Finally, the paper uses a natural policy gradient, which chooses the steepest gradient ascent direction with respect to the Fisher information metric. We use a deterministic version of the Fisher information metric, which is a special limiting case of the stochastic version when the stochastic policy’s variance approaches 0.

Taken from Determinist Policy Gradient Algorithms (Silver et al. 2014)

Luckily, with our compatible function approximator, we can easily see that the steepest ascent direction with respect to the Fisher information metric is simply w.

Implementing COPDAC-Q

Now, let’s try to implement a DPG! I will be trying to implement a COPDAC-Q with a natural policy gradient. The implementation is done in Pytorch and we will be using linear approximations for the policy, Q function, and value function.

Firstly, a quick introduction to the MountainCar environment. The agent is a car starting from the bottom of a hill and wants to reach the green flag at the top of the hill. The state space is continuous with Car velocity and Car position, and the action space is also continuous, a power coefficient that is used to calculate the driving force. The reward is 100 for reaching the flag and unlike the paper that only has a reward of -1 at each step, and the agent is penalized for energy usage at each step. The environment is considered solved when the agent achieves a score of 90 over 100 episodes.

Here are some of the hyperparameters used:

γ (discount factor) = 0.99
MAX_STEPS (per episode) = 5000
SOLVED_SCORE = 90
α_θ = 0.005
α_w = 0.03
α_v = 0.03
BATCH_SIZE = 8
Numpy random seed = 0
Pytorch random seed = 0

Something interesting is that the random seed here actually does matter as I wasn’t able to get my implementation to work for a long time, but changing the random seed made it work.

For this implementation, we will be taking advantage of experience replay and doing batch updating every 10 steps. The behavior policy will just be random sampling from the action space. We will also normalize the states using the mean and st. deviation obtained from sampling 10,000 random states. Note that we choose a higher learning rate for the policy evaluation as we want the evaluation to converge faster than the policy so the policy is moved in the correct direction.

Let’s first train our agent for 100 episodes with a batch size of 1. Here’s the training history:

After 100 episodes, our agent was able to achieve an average score of 82.06 over 10 playthroughs with a variance of 0.009. As the nature of the algorithm is deterministic, the low variance makes sense since the only uncertainty is the random starting state for the agent. The grey line is the training score from the behavior policy which does randomly get lucky, and the blue line is our target policy score. A fascinating thing is that even though the policy does initially go in the wrong direction around episode 17/18, the policy is still able to recover extremely quickly. I hypothesize that this is around the time where the Q and value function starts to converge, so the policy can learn fast. Here’s what it looks like:

Now let’s train our agent with a batch size of 8. Here’s the training history:

Our agent was able to achieve an average score of 90.10 with a variance of 0.0203. Looking at the score history, it may seem like a mistake. The target policy starts at a high score of nearly 100 after 1 episode. However, after looking at the trained parameters and making sure there were no NaN values, I tried to retrain the agent for 1 episode. Surprisingly, the agent had an average score of 98 ± 0.03 and can solve the MountainCar environment in just 1 episode! Here’s what it looks like:

Here’s the interesting thing. For the 1 episode agent, the cart takes a much longer time to reach the flag but has a higher score. If we were to evaluate this with the DPG paper’s metric, then this would have a lower score as it takes a longer time to reach the flag. However, the cart uses less energy here to build up momentum to reach the flag, resulting in a higher score with energy usage penalties. For the 100 episode agent, I would guess that the agent overtrains and diverges to a suboptimal solution as a result of using a Q-learning critic.

A thing to also note is that the results I was getting are not always consistent, as the agent would not be able to solve the environment for some trials. After modifying the reward to also reward the change in mechanical energy, ∇ME = ∇ (0.5mv²+mgh), the agent was able to consistently solve the environment.

Final Thoughts

I was pretty excited to tackle this implementation as I wanted to be able to replicate the results from the DPG paper, and it was also an important component of the DDPG algorithm. In the end, I was able to successfully solve the MountainCar environment with my implementation in Pytorch using only linear approximations. Something I did learn in this process is that the random seed does matter. My implementation was not working for a long time until I changed the seed. A change I would do in the future is to aggregate results from different seeds since doing everything on a single seed may give biased results. Intuitively, you would think that an algorithm shouldn’t be dependent on the random seed, but that is the nature of Reinforcement Learning. In the next post, I’ll be taking concepts from DPG and DQN and explore the algorithm DDPG that uses deep learning.

Code: https://github.com/chengxi600/RLStuff/blob/master/Deterministic%20Policy%20Gradients/COPDAC-Q.ipynb

References: