Reinforcement Learning: Introduction to Policy Gradients

Cheng Xi Tsou
Nerd For Tech
Published in
9 min readJul 14, 2021

In the previous posts, I have been working on a form of Reinforcement learning, Q learning, where the agent finds an optimal policy that maximizes the total reward over a trajectory of states. I later expanded on Q learning with deep neural networks and explored the implementation of DQN. In this post, I wanted to explore a different approach of reinforcement learning called policy gradients which aims to optimize the policy directly. Unlike Q learning, the agent does not select the best action based on state-action values but rather from a probability distribution of actions. I will be showing the proof of the policy gradient theorem and a naive algorithm, REINFORCE (Williams 1992), that uses this derivation. Surprisingly, Williams was a professor at Northeastern University, where I’m studying right now!

Introduction

First, let's define some terms. Unlike an epsilon greedy algorithm that chooses the max value action with some noise, we are selecting an action based on the current policy. π(a | s, θ) = Pr{Aₜ = a | Sₜ = s, θₜ = θ}, which is the probability of an action a at timestep t given the state s at timestep t and the policy’s parameters θ at timestep t.

Now, the agent will learn the policy based on the gradient of a performance measure function J(θ) with respect to θ. We will be using gradient ascent to adjust the policy parameters to find the optimal policy: θₜ₊₁ = θₜ + α∇J(θₜ).

When dealing with the policy function π, we can preserve the concept of exploration by ensuring that π(a | s, θ) ∈ (0, 1) so the policy will not become deterministic. For parameterized numerical preferences we come up with a value h(s, a, θ) for a state-action pair. We can use softmax to turn this into a probability distribution:

There are some advantages in selecting actions according to a softmax over action preferences rather than an epsilon greedy strategy. First, action preferences allow the agent to approach a deterministic policy, whereas epsilon greedy will need to maintain some degree of noise to encourage exploration. Action preferences are also focused on building an optimal stochastic policy rather than converging on a value. This means that if the policy is deterministic, then the probability of the best action will be driven to approach 1. If the best policy is to be stochastic, then the action preferences will form a probability distribution. With an epsilon greedy strategy, there is no natural way to form a stochastic policy.

Another advantage is that the action probability is adjusted smoothly over a function of the policy parameter. With an epsilon greedy strategy, a small change in Q value can result in a different action if we are selecting an action based on max value. This can dramatically overestimate the importance of a selected action and converge into a suboptimal policy.

Now let’s see how the agent can learn the parameters of an optimal policy. Remember that we will be adjusting our parameters based on a performance measure function J(θ). This brings us to the policy gradient theorem, which lets us approximate the gradient of J(θ) with respect to θ without a derivation of the state distribution.

Taken from Sutton & Barto, 2017

μ(s) here is an on-policy distribution of our stochastic policy π. q is an action-value function following policy π, and π(a|s, θ) is the action distribution given a state under parameters θ.

This solves the problem of needing to know the state distribution in an unknown environment. The following proof (Sutton & Barto, 2017; Sec. 13.2) will be a detailed walkthrough with help from Lillian Weng’s article.

Derivation of Policy Gradient Theorem

Define J(θ) = V_πθ(s), where V is the value function for policy π with parameters θ. Then we have:

We can expand the value function to be similar to that of an expectation equation, where we find the summation of the probability of an action following policy π multiplied by the value of a state-action pair following policy π, for all actions in the action space.

Then in (3), we apply the product rule: d/dx [f(x)g(x)] = f’(x)g(x) + f(x)g’(x). In (4), we expand on the q function. As we stated before, the q function gives us a value given a state-action pair. If we assume the environment is non-deterministic, then the next state will also be non-deterministic. We multiply the probability of a new state s’ and reward r given a state-action pair and the reward + value function of next state s’. This gives us the expected value of the resulting new state s’. We then take the summation over all possible new states. Notice that the reward uses a fundamental concept in reinforcement learning, where we must take account of current rewards and the future utilities as well.

We can simplify p(s’, r | s, a) to p(s’ | s, a) as the summation of all rewards in p(s’, r | s, a) is the same as p(s’ | s, a). Then, we can move past p(s’, r| s, a) since r is not a function of θ (the gradient is with respect to θ). The reward goes away as it is a constant.

Recall the line (1) of our derivation, ∇J(θ) = ∇v(s). Notice the similar term v(s’) at the end of our expanded equation. We can observe that ∇v(s) is recursive. Let's try to expand a bit on the recursion.

As you can see, in line (8), we replaced the term ∇v(s’) with what we derived so far, except for the transition s’ to s’’. We can define this transition to simplify the equation a bit.

The function above is the probability of a transition from state s to state s’ in k steps following policy π. We can then revise steps (6) and (7) to use this but also remember to keep the summation over all possible next states s’.

We can simplify even further. If we distribute the multiplied terms, we can combine Pr(s → s’, k=1) with Pr(s’ → s’’, k=1) to get Pr(s → s’’, k=2).

After repeatedly unrolling the recursive bit, we can write this as a summation.

Here we have the summation over all goal states x over all possible length k steps of the transition from starting state s to state x. Note that at k = 0, Pr(s → s, k=0, π) is trivially 1, so we have not left out the first ϕ(s) term.

Now we can move towards the conclusion of the proof.

On line (12), we have the distribution of the transition from starting state s0 to end state s for any k length steps which we simplify to η(s). Then on line (14), we normalize η(s) to be a probability distribution ∈ (0, 1) for each end state s. We have then constructed μ(s), which is the probability distribution of actions following our stochastic policy. When we expand the ϕ(s) function, we have proven the relation:

Taken from Sutton & Barto, 2017

REINFORCE algorithm

Now with the policy gradient theorem, we can come up with a naive algorithm that makes use of gradient ascent to update our policy parameters. The theorem gives a sum over all states and actions but when we update our parameters, we are only going to be using a sample gradient since there's just no way we can get the gradient for all possible actions and states. We can write this as an expectation as the expectation of a sample gradient is the same as the actual gradient. Thus we have the following:

Notice that we have replaced s with Sₜ in (2). This is because we are formulating the equation to take in a sample gradient, where we are using a sample state Sₜ and a sample action Aₜ, instead of a summation over all states and actions. Now, let's replace the summation of all actions with a sample action Aₜ.

In (3), we multiple the value by the term at the end so that the value of a sample action is weighted by the probability of selecting the action. Then in (4), we can get rid of the summation over all actions and replace a with a sample action Aₜ. In (5), we can replace the state-action value function with Gₜ, the cumulative discounted reward at timestep t. Then, we can simplify the equation using the fact d/dx[ln(x)] = 1/x.

We can then use this term inside the expectation to perform gradient ascent to update our policy parameter. We can use this update in the naive REINFORCE algorithm.

Taken from Sutton & Barto, 2017

Notice that this algorithm is a type of Monte-Carlo method, as we are not updating the parameter after every timestep but rather episodically sampling a whole trajectory with a static parameter θ, then updating the parameter. In the pseudocode above taken from the textbook (Sutton & Barto, 2017), the update line is slightly wrong as we should not be updating the same parameter that we are using to get the policy log probability. In our implementation, we can simply accumulate the gradient and update the parameter at the end of the episode all at once. Now let’s get to the implementation!

Implementing REINFORCE

For my implementation of REINFORCE, I decided to use Pytorch instead of Keras as I wanted to explore other machine learning libraries. For the policy parameter, I used a neural network as it can do the gradient ascent for me. Note that I will be adding a negative sign in front of the gradient to perform gradient ascent instead of gradient descent. Here’s my network:

#Using a neural network to learn our policy parameters
class PolicyNetwork(nn.Module):

#Takes in observations and outputs actions
def __init__(self, observation_space, action_space):
super(PolicyNetwork, self).__init__()
self.input_layer = nn.Linear(observation_space, 128)
self.output_layer = nn.Linear(128, action_space)

#forward pass
def forward(self, x):
#input states
x = self.input_layer(x)

#relu activation
x = F.relu(x)

#actions
actions = self.output_layer(x)

#get softmax for a probability distribution
action_probs = F.softmax(actions, dim=1)

return action_probs

After implementing the algorithm, I tested my results with the following parameters:

  • γ (discount factor): 0.99
  • NUM_EPISODES: 1000
  • MAX_STEPS: 10000

First, let's look at how our agent does with a random policy over 150 episodes:

Unsurprisingly, these were some subpar results, with an average score of 21.48. Now, let’s look at the training history of our algorithm over 1000 episodes.

Cumulative reward each episode

Looking at our training history, we can see that our agent slowly learns over time. However, we notice that compared to other learning methods such as DQN in my previous post, the learning is slow and has high variance. This is to be expected as the REINFORCE algorithm is less complex compared to one that uses a deep neural network. Nonetheless, we were able to achieve an average score of 79.6 over 50 episodes using the learned policy, which is better than the human benchmark (me) of 30.8.

Overall, this was a great introduction to the class of policy gradient methods, which I had a harder time understanding than Q-learning that I learned from my CS4100 lectures. I thought it was interesting trying to understand all the derivations and reasoning behind each equation following Sutton & Barto’s textbook and it gave me a stronger understanding of the topic. Moving forward, I definitely want to learn more about policy gradient methods and also Pytorch which I played around a little bit while implementing REINFORCE.

My code: https://github.com/chengxi600/RLStuff/blob/master/Policy%20Gradients/REINFORCE.ipynb

References:

--

--