Policy Gradients: REINFORCE with Baseline

Cheng Xi Tsou
Nerd For Tech
Published in
7 min readJul 17, 2021

--

After an introduction to the REINFORCE algorithm, I wanted to explore a little bit further this simple algorithm derived from the policy gradient theorem. In this post, I will be implementing REINFORCE with baseline and some small modifications and testing it out on the CartPole environment. At the end of the post, I go over some bugs I encountered using the Pytorch library.

Taken from Sutton & Barto 2017

Recall the policy gradient theorem we derived. We have established that the expectation of a sample gradient is equivalent to the actual gradient, so we can rewrite an update equation that takes in a sample action and a sample state. This led us to this:

In our implementation of REINFORCE, we used a neural network to help us update our parameters and performed “gradient descent” on the negative gradient, which is basically just a workaround for performing gradient ascent. However, after training for 1000 episodes, we only managed to get an average score of 79.6 over 50 episodes of the agent playing with the learned policy. After looking at the training history, we can notice that there is a high variance in training scores with sudden spikes of scores reaching over 400 and dropping down right after. The slow learning rate and high variance of the REINFORCE method lead us to an improved variation: REINFORCE with baseline.

Expanding upon the policy gradient theorem, we can further generalize the theorem by adding an arbitrary baseline function we subtract from the state-action value function.

Taken from Sutton & Barto 2017

This is still the same thing as the previously derived theorem, where b(s) = 0. We can substitute that b(s) function with any arbitrary function or even a constant as long as it does not vary with a. This is because the subtracted quantity is equal to zero as shown below, so the theorem is still valid.

Taken from Sutton & Barto 2017

Our update equation will also be changed with the baseline function and look like this:

Now for the most important question: what should the baseline function be? A very simple baseline we can use is by applying a whitening transformation on the reward. If you’ve looked at other implementations of naive REINFORCE most of them use this technique to decrease variance and remove bias from rewards, but this is also a type of baseline function we can modify the update function with.

A more complex baseline we can use is a state-value function. Since the learning for this algorithm is episodic, we can use a state-value function that leans episodically as well. Gₜ - (Sₜ) would just be the difference between what the discounted reward should be at timestep t and what our state value function believes it to be. This would be an easy way for us to adjust the parameters of the state function as well. We can see the pseudo-code for REINFORCE with baseline taken from Sutton&Barto’s textbook:

Implementation and Results

For my implementation, I used my previous code as a base and cleaned it up a bit. I used a Policy network to learn the parameters θ and a separate network to learn the parameters w. For updating the state value function parameters, I used mean squared error to calculate the difference between Gₜ and v(Sₜ) instead. Here’s a quick peek at my networks:

class PolicyNetwork(nn.Module):

#Takes in observations and outputs actions
def __init__(self, observation_space, action_space):
super(PolicyNetwork, self).__init__()
self.input_layer = nn.Linear(observation_space, 128)
self.output_layer = nn.Linear(128, action_space)

#forward pass
def forward(self, x):
#input states
x = self.input_layer(x)

#relu activation
x = F.relu(x)

#actions
actions = self.output_layer(x)

#get softmax for a probability distribution
action_probs = F.softmax(actions, dim=1)

return action_probs

State Value Network:

class StateValueNetwork(nn.Module):

#Takes in state
def __init__(self, observation_space):
super(StateValueNetwork, self).__init__()

self.input_layer = nn.Linear(observation_space, 128)
self.output_layer = nn.Linear(128, 1)

def forward(self, x):
#input layer
x = self.input_layer(x)

#activiation relu
x = F.relu(x)

#get state value
state_value = self.output_layer(x)

return state_value

Here are the hyperparameters I used for this algorithm:

  • γ (discount factor): 0.99
  • NUM_EPISODES: 1000
  • MAX_STEPS: 10000
  • α (Policy LR): 0.01
  • β (Value LR): 0.1

Let’s first look at the results of using a simple baseline of whitening rewards:

Our agent was able to achieve an average score of 234.4 over 50 episodes when playing by our learned policy. This is better than the score of 79.6 with the naive REINFORCE algorithm. However, only using whitening rewards still gives us a high variance in training scores. In fact, there was one trial where the model parameters were stuck at a suboptimal local maxima:

Let’s see our results of using a learned state-value function as our baseline:

Our agent achieved an average score of 500 over 50 episodes, which is the max score using our learned policy! This is far better than an average score of 79.6 without the baseline. Here is a quick demo of our agent in action:

Looking at the training history, it is clear that there is an upward trend in the score. Although there is still some instability, it is far less than the previous training history with constant up and down fluctuations. We can also observe a pattern in the training history. The agent achieves a score plateau of 500 around 100 episodes in before dropping down and “relearning” until episode 400 where it drops again. This pattern repeats itself throughout the 1000 episodes. This behavior can be explained by a similar phenomenon that occurs in Q-learning which we solved through experience replay, called catastrophic forgetting. The network becomes adjusted to fit new inputs and “forgets” the old inputs which can lead to performance collapse. A way we can fit this could be to implement some conditions for the agent to stop training to prevent overfitting.

According to the CartPole documentation, the environment is considered solved when the agent has an average score of 195 over the last 100 episodes. If we look at our training history, our network easily scored over 195 in under 100 episodes of training. Let’s increase the solved score to 450 instead, with some leeway for variance.

We were able to achieve an optimal policy in just 584 episodes, a ~50% increase in training speed! Testing this learned policy over 50 episodes gave us an average score of 500, the maximum score.

Looking at the training history without early stopping, we can observe a huge discrepancy in the pattern of learning. With this run, the agent achieves top scores with high variance before stabilizing around 500 episodes. In the previous training history, there were three instances of the agent learning the optimal policy at episodes 100, 350, 700, before collapsing. The poor reproducibility of results is due to the stochastic nature of REINFORCE and is still a huge problem in the field of Reinforcement Learning. I ran the experiment four more times and achieved an average stopping point of 400.6 episodes over 5 trials with all trials having an average score of 500 over 50 episodes when playing with the learned policy.

Final Thoughts

Although I did not expect to spend a long time on this, I did run into a lot of bugs on Pytorch due to my lack of understanding of the library. When I was trying to backpropagate my policy network after updating the value network, I kept on running into an error: “RuntimeError: Trying to backward through the graph a second time…”. This was because I did not use a tensor to calculate my policy loss and tried to call loss.backward().

So what went wrong here? In Pytorch, a tensor is not only used to perform operations like a NumPy array but it also tracks the tensor inputs and functions used on the tensor, forming a directly acyclic graph where the roots are the final output of the tensor and the leaf nodes are the input tensors. When we call loss.backward(), Pytorch backpropagates by traversing from the loss (root node) to the leaf nodes, kind of like the chain rule. By not using a tensor to calculate policy loss, the network didn’t know how to backpropagate and assumed that I was trying to backpropagate through the DAG formed by the value network a second time.

Overall, I learned a lot from what I thought was going to be a short continuation of REINFORCE and hopefully can slowly build up to more state-of-the-art policy gradient methods that can deal with a continuous action space.

My code: https://github.com/chengxi600/RLStuff/blob/master/Policy%20Gradients/REINFORCE-Baseline.ipynb

References:

--

--