Understanding Baseline Techniques for REINFORCE

Fork Tree
17 min readOct 17, 2019

--

By Phillip Lippe, Rick Halm, Nithin Holla and Lotta Meijerink

Ever since DeepMind published its work on AlphaGo, reinforcement learning has become one of the ‘coolest’ domains in artificial intelligence. The capability of training machines to play games better than the best human players is indeed a landmark achievement. Amongst all the approaches in reinforcement learning, policy gradient methods received a lot of attention as it is often easier to directly learn the policy without the overhead of learning value functions and then deriving a policy. With advancements in deep learning, these algorithms proved very successful using powerful networks as function approximators.

One of the earliest policy gradient methods for episodic tasks was REINFORCE, which presented an analytical expression for the gradient of the objective function and enabled learning with gradient-based optimization methods. The algorithm involved generating a complete episode and using the return (sum of rewards) obtained in calculating the gradient. However, the method suffers from high variance in the gradients, which results in slow unstable learning and a lot of frustration…

It was soon discovered that subtracting a ‘baseline’ from the return led to reduction in variance and allowed faster learning. Several such baselines were proposed, each with its own set of advantages and disadvantages. The easy way to go is scaling the returns using the mean and standard deviation. This technique, called whitening is often necessary for good optimization, especially in the deep learning setting.

However, more sophisticated baselines are possible. We could learn to predict the value of a state, i.e., the expected return from the state, along with learning the policy and then use this value as the baseline. Starting from the state, we could also make the agent greedy, by making it take only actions with maximum probability, and then use the resulting return as the baseline. This approach, called self-critic, was first proposed in Rennie et al.¹ and also shown to give good results in Kool et al.² Another promising direction is to grant the agent some special powers - the ability to play till the end of the game from the current state, go back to the state and play more games following alternative decision paths. The average of returns from these plays could serve as a baseline. This method, which we call the self-critic with sampled rollout, was described in Kool et al.³ The greedy rollout is actually just a special case of the sampled rollout if you consider only one sample being taken by always choosing the greedy action.

While most papers use these baselines in specific settings, we are interested in comparing their performance on the same task. This is what we will do in this blog by experimenting with the following baselines for REINFORCE:

  • Whitening the returns of the episodes
  • Learned value function
  • Self-critic with sampled rollout

We will go into detail for each of these methods later in the blog, but here is already a sneak peek of our models we test out.

Visualization of the three methods. 1. Regular REINFORCE. 2.REINFORCE with learned baseline: an external function takes a state and outputs its value as the baseline. 3. REINFORCE with sampled baseline: the average return over a few samples is taken to serve as the baseline.

We focus on the speed of learning not only in terms of number of iterations taken for successful learning but also the number of interactions done with the environment to account for the hidden cost in obtaining the baseline. Also, while most comparative studies focus on deterministic environments, we go one step further and analyze the relative strengths of the methods as we add stochasticity to our environment.

The outline of the blog is as follows: we first describe the environment and the shared model architecture. Then we will show results for all different baselines on the deterministic environment. Finally, we will compare these models after adding more stochasticity to the environment.

Experimental setup

CartPole

The environment we focus on in this blog is the CartPole environment from OpenAI’s Gym toolkit, shown in the GIF below. The environment consists of an upright pendulum joint to a cart. This system is unstable, which causes the pendulum to fall over. The goal is to keep the pendulum upright by applying a force of -1 or +1 (left or right) to the cart. A reward of +1 is provided for every time step that the pole remains upright. The state is described by a vector of size 4, containing the position and velocity of the cart as well as the angle and velocity of the pole. The episode ends when the pendulum falls over or when 500 time steps have passed.

We work with this particular environment because it is easy to manipulate, analyze and fast to train. Also, it is a very classic example in reinforcement learning literature.

Illustration of CartPole. Actually one of our models is playing right now, which we will see later.

Model

We want to learn a policy, meaning we need to learn a function that maps states to a probability distribution over actions. In all our experiments, we use the same neural network architecture, to ensure a fair comparison. The network takes the state representation as input and has 3 hidden layers, all of them with a size of 128 neurons. We use ELU activation and layer normalization between the hidden layers. We output log probabilities of the actions by using the LogSoftmax as the final activation function.

Experiments

We optimize hyperparameters for the different approaches by running a grid search over the learning rate and approach-specific hyperparameters. We use same seeds for each gridsearch to ensure fair comparison. We always use the Adam optimizer (default settings). After hyperparameter tuning, we evaluate how fast each method learns a good policy. We compare the performance against:

  • number of update steps (1 iteration = 1 episode + gradient update step)
  • number interactions (1 interaction = 1 action taken in the environment)

The number of iterations needed to learn is a standard measure to evaluate. The number of interactions is (usually) closely related to the actual time learning takes. In our case, analyzing both is important because the self-critic with sampled baseline uses more interactions (per iteration) than the other methods.

Baseline techniques

REINFORCE

For an episodic problem, the Policy Gradient Theorem provides an analytical expression for the gradient of the objective function that needs to be optimized with respect to the parameters θ of the network.

where π(a|s, θ) denotes the policy parameterized by θ, q(s, a) denotes the true value of the state-action pair and μ(s) denotes the distribution over states. The REINFORCE algorithm takes the Monte Carlo approach to estimate the above gradient elegantly. Using samples from trajectories, generated according the current parameterized policy, we can estimate the true gradient

Here, Gt is the discounted cumulative reward at time step t. Writing the gradient as an expectation over the policy/trajectory allows us to update the parameter similar to stochastic gradient ascent:

As with any Monte Carlo based approach, the gradients of the REINFORCE algorithm suffer from high variance as the returns exhibit high variability between episodes - some episodes can end well with high returns whereas some could be very bad with low returns. High variance gradients leads to unstable learning updates, slow convergence and thus slow learning of the optimal policy.

To tackle the problem of high variance in the vanilla REINFORCE algorithm, a baseline is subtracted from the obtained return while calculating the gradient.

It can be shown that introduction of the baseline still leads to an unbiased estimate (see for example this blog). But most importantly, this baseline results in lower variance, hence better learning of the optimal policy.

A simple baseline, that looks similar to a trick commonly used in optimization literature, is to normalize the returns of each step of the episode by subtracting the mean and dividing by the standard deviation of returns at all time steps within the episode. This is called whitening. Note that whereas this is a very common technique, the gradient is no longer unbiased.

The results on the CartPole environment are shown in the following figure. The optimal learning rate found by gridsearch over 5 different rates is 1e-4. The algorithm does get better over time as seen by the longer episode lengths. However, it does not solve the game (reach an episode of length 500). Also, the algorithm is quite unstable, as the blue shaded areas (25th and 75th percentiles) show that in the final iteration, the episode lengths vary from less than 250 to 500. Note that the plot shows the moving average (width 25). This is also applied on all other plots of this blog.

Moving average episode length (width 50) of REINFORCE (with whitening) over 32 random seeds, with 25th and 75th percentile spread, against the number of iterations and the number of interactions with the environment.

Technically, any baseline would be appropriate as long as it does not depend on the actions taken. However, the most suitable baseline is the true value of a state for the current policy. In this way, if the obtained return is much better than the expected return, the gradients are stronger and vice-versa.

The problem however is that the true value of a state can only be obtained by using an infinite number of samples. The following methods show two ways to estimate this expected return of the state under the current policy.

REINFORCE with learned baseline

If we are learning a policy, why not learn a value function simultaneously? If we learn a value function that (approximately) maps a state to its value, it can be used as a baseline. This is what is done in state-of-the-art policy gradient methods like A3C.

We have implemented the simplest case of learning a value function with weights w. A common way to do it is to use the observed return Gt as a ‘target’ of the learned value function. Because Gt is a sample of the true value function for the current policy, this is a reasonable target.

Using the learned value as baseline, and Gt as target for the value function, leads us to two loss terms:

  • The regular REINFORCE loss, with the learned value as a baseline
  • The mean squared error between the learned value and the observed discounted return Gt.

Taking the gradients of these losses results in the following update rules for the policy parameters θ and the value function parameters w, where α and β are the two learning rates:

Implementation-wise, we simply added one more output value to our existing network. This output is used as the baseline and represents the learned value. This means that most of the parameters of the network are shared. We do one gradient update with the weighted sum of both losses, where the weights correspond to the learning rates α and β, which we tuned as hyperparameters.

Note that if we hit the 500 as episode length, we bootstrap on the learned value function. This means that cumulative reward of the last step is the reward plus the discounted, estimated value of the final state, similarly to what is done in A3C. By this, we prevent to punish the network for the last steps although it succeeded.

The results that we obtain with our best model are shown in the graphs below. Hyperparameter tuning leads to an optimal learning rates of α=2e-4 and β=2e-5 . We again plot the average episode length over 32 seeds, compared to the number of iterations as well as the number of interactions. As before, we also plotted the 25th and 75th percentile. We see that the learned baseline reduces the variance by a great deal, and the optimal policy is learned much faster.

REINFORCE with learned baseline compared to REINFORCE (whitened) showing the moving average and the 25th and 75th percentile spread (over 32 seeds). Models are compared with respect to the iterations and interactions with the environment.

What is interesting to note is that the mean is sometimes lower than the 25th percentile. In our case this usually means that in more than 75% of the cases, the episode length was optimal (500) but that there were a small set of cases where the episode length was sub-optimal. This way, the average episode length is lower than 500.

REINFORCE with sampled baseline

As mentioned before, the optimal baseline is the value function of the current policy. The issue of the learned value function is that it is following a moving target, meaning that as soon as we change the policy the slightest, the value function is outdated, and hence, biased. To always have an unbiased, up-to-date estimate of the value function, we could instead sample our returns, either from the current stochastic policy or greedy version as:

So, to get a baseline for each state in our trajectory, we need to perform N rollouts, or also called beams, starting from each of these specific states, as shown in the visualization below. However, in most environments such as CartPole, our trajectory length can be quite long, up to 500. This would require 500*N samples which is extremely inefficient. Nevertheless, by assuming that close-by states have similar values, as not too much can change in a single frame, we can re-use the sampled baseline for the next couple of states.

Visualization of the sampling rollouts every N steps. We use the value estimates of the beams as baseline.
Illustration of the sampled baseline. Every 40 frames, we stop our current trajectory (darkest CartPole), and sample 4 new trajectories (lighter CartPoles). The average return of these 4 rollouts forms the baseline of the main trajectory.

The number of rollouts you sample and the number of steps in between the rollouts are both hyperparameters and should be carefully selected for the specific problem. Simply sampling every K frames scales quadratically in number of expected steps over the trajectory length. However, in most environments such as CartPole, the last steps determine success or failure, and hence, the state values fluctuate most in these final stages. Thus, we want to sample more frequently the closer we get to the end. To implement this, we choose to use a log scale, meaning that we sample from the states at T-2, T-4, T-8, etc. frames before the terminating state T. Using these value estimates as baselines, the parameters of the model are updated as shown in the following equation.

Interestingly, by sampling multiple rollouts, we could also update the parameters on the basis of the j’th rollout. Now the estimated baseline is the average of the rollouts including the main trajectory (and excluding the j’th rollout). For example, assume we take a single beam. Then we can train the states from our main trajectory based on the beam as baseline, but at the same time, use the states of the beam as well as training points, where the main trajectory serves as baseline. This method more efficiently uses the information obtained from the interactions with the environment⁴.

Applying this concept to CartPole, we have the following hyperparameters to tune: number of beams for estimating the state value (1, 2, and 4), the log basis of the sample interval (2, 3, and 4), and the learning rate (1e-4, 4e-4, 1e-3, 2e-3, 4e-3). Performing a gridsearch over these parameters, we found the optimal learning rate to be 2e-3. This is considerably higher than for the previous two methods, suggesting that the sampled baseline give a much lower variance for the CartPole environment. Besides, the log basis did not seem to have a strong impact, but the most stable results were achieved with log 2.

The results with different number of rollouts (beams) are shown in the next figure. Sensibly, the more beams we take, the less noisy the estimate and quicker we learn the optimal policy. However, also note that by having more rollouts per iteration, we have many more interactions with the environment; and then we could conclude that more rollouts is not per se more efficient. The figure shows that in terms of the number of interactions, sampling one rollout is the most efficient in reaching the optimal policy. However, taking more rollouts leads to more stable learning. We also performed the experiments with taking one greedy rollout. The results were slightly worse than for the sampled one which suggests that exploration is crucial in this environment.

Comparing REINFORCE with sampled baseline over different number of beams, on iterations (i.e. update steps) and number of interactions with the environment (averaged over 32 seeds). Note that the y-axis is cropped at 400 due to the very fast learning of the models. We omitted the greedy experiment for better clarity of the plots.

As our main objective is to compare the data efficiency of the different baselines estimates, we choose the parameter setting with a single beam as the best model.

Comparing all baseline methods together we see a strong preference for REINFORCE with the sampled baseline as it already learns the optimal policy before 200 iterations. However, when we look at the number of interactions with the environment, REINFORCE with a learned baseline and sampled baseline have similar performance. This indicates that both methods provide a proper baseline for stable learning.

Comparison of all three baseline estimates over number of iterations and interactions with the environment.

Nevertheless, there is a subtle difference between the two methods when the optimum has been reached (i.e. episode length of 500). In the case of the sampled baseline, all rollouts reach 500 steps so that our baseline matches the value of the current trajectory, resulting in zero gradients (no learning) and hence, staying stable at the optimum. On the other hand, the learned baseline has not converged when the policy reaches the optimum because the value estimate is still behind. This enables the gradients to be non-zero, and hence can push the policy out of the optimum which we can see in the plot above.

All together, this suggests that for a (mostly) deterministic environment, a sampled baseline reduces the variance of REINFORCE the best. This can be even achieved with a single sampled rollout. While the learned baseline already gives a considerable improvement over simple REINFORCE, it can still unlearn an optimal policy. However, all these conclusions only hold for the deterministic case, which is often not the case.

Experiments with added stochasticity

So far, we have tested our different baselines on a deterministic environment: if we do some action in some state, we always end up in the same next state. However, this is not realistic because in real-world scenarios, external factors can lead to different next states or perturb the rewards. In a stochastic environment, the sampled baseline would thus be more noisy. Therefore, we expect that the performance gets worse when we increase the stochasticity.

We test this by adding stochasticity over the actions in the CartPole environment. p% of the time, a random action is chosen instead of the action that the network suggests. Note that as we only have to actions, it means in p/2% of the cases, we take a wrong action. This is similar to adding randomness to the next state we end up in: we sometimes end up in another state than expected for a certain action.

To find out when the stochasticity makes a difference, we test choosing random actions with 10%, 20% and 40% chance. At 10%, we experience that all methods achieve similar performance as with the deterministic setting, but with 40%, all our methods are not able to reach a stable performance of 500 steps. The experiments of 20% have shown to be at a tipping point. The results for our best models from above on this environment are shown below.

Mean episode length of the best models of REINFORCE with different baselines on the CartPole environment with 20% added stochasticity. Again, we plot the mean over 32 random seeds, with 25th and 75th percentile, against the number of iterations and the number of interactions.

We see that the sampled baseline no longer gives the best results. Instead, the model with the learned baseline performs best. In terms of number of iterations, the sampled baseline is only slightly better than regular REINFORCE. In terms of number of interactions, they are equally bad.

The learned baseline apparently suffers less from the introduced stochasticity. We can explain this by the fact that the learned value function can learn to give an expected/averaged value in certain states. Thus, the learned baseline is only indirectly affected by the stochasticity, whereas a single sampled baseline will always be noisy. However, we can also increase the number of rollouts to reduce the noise. The following figure shows the result when we use 4 samples instead of 1 as before.

Comparing mean episode length for 4 beams. We use the same figure settings as the plot before, but just replaced the sampled baseline with 4 beams instead of 1.

Now, by sampling more, the effect of the stochasticity on the estimate is reduced and hence, we are able to reach similar performance as the learned baseline. Nevertheless, this improvement comes with the cost of increased number of interactions with the environment. This shows that although we can get the sampled baseline stabilized for a stochastic environment, it gets less efficient than a learned baseline.

What about other environments?

We would like to have tested on more environments. However, the fact that we want to test the sampled baseline restricts our choice. One of the restrictions is that the environment needs to be duplicated because we need to sample different trajectories starting from the same state. Atari games and Box2D environments in OpenAI do not allow that. We could circumvent this problem and reproduce the same state by rerunning with the same seed from start. However, the time required for the sampled baseline will get infeasible for tuning hyperparameters. For example, for the LunarLander environment, a single run for the sampled baseline takes over 1 hour.

Another problem is that the sampled baseline does not work for environments where we rarely reach a goal (for example the MountainCar problem). If the current policy cannot reach the goal, the rollouts will also not reach the goal. And if none of the rollouts reach the goal, this means that all returns will be the same, and thus the gradient will be zero. Without any gradients, we will not be able to update our parameters before actually seeing a successful trial. The other methods suffer less from this issue because their gradients are mostly non-zero, and hence, this noise gives a better exploration for finding the goal. This is why we were unfortunately only able to test our methods on the CartPole environment.

Conclusion

We have seen that using a baseline greatly increases the stability and speed of policy learning with REINFORCE. In the deterministic CartPole environment, using a sampled self-critic baseline gives good results, even using only one sample. It learned the optimal policy with the least number of interactions, with the least variation between seeds. Also, the optimal policy is not unlearned in later iterations, which does regularly happen when using the learned value estimate as baseline. However, the difference between the performance of the sampled self-critic baseline and the learned value function is small.

Furthermore, in the environment with added stochasticity, we observed that the learned value function clearly outperformed the sampled baseline. Stochasticity seems to make the sampled beams too noisy to serve as a good baseline. Another limitation of using the sampled baseline is that you need to be able to make multiple instances of the environment at the same (internal) state and many OpenAI environments do not allow this.

A not yet explored benefit of sampled baseline might be for partially observable environments. For example, assume we have a two dimensional state space where only the second dimension can be observed. In the case of learned value functions, the state estimate for s=(a1,b) is the same as for s=(a2,b), and hence learns an average over the hidden dimensions. In contrast, the sample baseline takes the hidden parts of the state into account, as it will start from s=(a1,b). This can be a big advantage as we still have unbiased estimates although parts of the state space is not observable.

To conclude, in a simple, (relatively) deterministic environment we definitely expect the sampled baseline to be a good choice. In the case of a stochastic environment, however, using a learned value function would probably be preferable.

--

--