Policy Gradient Reinforcement Learning with Keras

A step by step approach to understanding REINFORCE

Gili Karni

Published in

The Startup

5 min readApr 22, 2020

This tutorial will:

Provide a theoretical review of the REINFORCE algorithm
Explain a python implementation for a deep REINFORCE using Keras
Serve as one of the initial steps to using Ensemble learning (scroll to the end to find out more!).

You can find the full version of the code here.

Who this is for: This tutorial is for anyone who has had some exposure to reinforcement learning. This post assumes no knowledge of REINFORCE, but to implement it, you should be comfortable with Python and Keras. I hope you find it easy to follow regardless of your background.

The REINFORCE Algorithm in Theory

REINFORCE is a policy gradient method. As such, it reflects a model-free reinforcement learning algorithm. Practically, the objective is to learn a policy that maximizes the cumulative future reward. We define a policy as a probability distribution of actions in a way that actions associated with higher expected rewards have a higher probability value for a given observed state.

Note, that a policy could be deterministic — thus, maps state to actions. However, here, we focus on stochastic environments — hence — stochastic policies. To read more about why policy methods are great, see this post.

Its name implies that we optimize the policy using its gradient ascent. More specifically, we take the partial derivative of the objective, J, with respect to the policy parameter, θ.

REINFORCE is a Monte-Carlo variant of policy gradients. It updates the policy by stochastically collecting information on the environment. Let’s take a look at the algorithm below.

It generates episodes by sampling a likely next action, given the current policy, 𝝅.
For each episode, it computes the total return, G(t), via an iterative review of all the states of the episode.
Lastly, it updates the policy. The policy update includes the discounted cumulative future reward, the log probabilities of actions, and the learning rate (𝛼).

from Sutton Barto book: Introduction to Reinforcement Learning

In REINFORCE, the expectation of the sample gradient is equal to the actual gradient (see figure below) — thus it reflects a good theoretical convergence property. However, being Monte Carlo based, it may suffer from high variance.

In REINFORCE, the expectation of the sample gradient is equal to the actual gradient (see figure below) — thus it reflects a good theoretical convergence property. However, being a Monte-Carlo-based method, it may suffer from high variance.

Implementing the REINFORCE Algorithm

The implementation here is of a deep-REINFORCE. I use Keras to build the model and an environment from OpenAI gym. Specifically, I chose to demonstrate the agent using the CartPole environment.

Let’s take a look at the code.

Setting the environment configuration

First, setting the configuration parameters. Importantly, this part ensures the reproducibility of the code below by using a random seed.

At initiation, the REINFORCE object sets a few parameters. First, is the environment in which the model learns and its properties. Second, are both the parameters of the REINFORCE algorithm — Gamma (𝛾) and alpha (𝛼). Gamma, as discussed above is the decay rate of past observations and alpha is the learning rate by which the gradients affect the policy update. Lastly, it sets the learning rate for the neural network. In addition, this snippet includes the saved space for recording the observations during the game.

The REINFORCE agent object uses a couple of utility methods. The first, hot_encode_action, encodes the actions into a one-hot-encoder format (read more about what is one-hot-encoding and why is it a good idea, here). And the second, remember, records the observations of each step.

Creating a Neural Network Model

I chose to use a neural network to implement this agent. The network is a simple feedforward network with a few hidden layers. The output layer incorporates a softmax activation. The softmax function takes in logit scores and outputs a vector that represents the probability distributions of a list of potential outcomes.

Action Selection

The get_action method guides its action choice. It uses the neural network to generate a normalized probability distribution for a given state. Then, it samples its next action from this distribution.

Constructing the Reward

The REINFORCE model includes a discounting parameter, 𝛾, that governs the long term reward calculation. Using gamma, the model discounts rewards from the early stages of the game. This calculation ensures that longer episodes would award a state-action pair greater than shorter ones. This function returns the normalized vector of discounted rewards.

Updating the Policy

Following each Monte-Carlo episode, the model uses the data collected to update the policy parameters. Recall the last step shown in the pseudo-code above. Here, training the neural network updates the policy. The network fits a vector of states to a vector of the gradients multiplied by the discounted rewards and the learning rate. This step facilitates the stochastic gradient ascent optimization.

Training the model

This method creates a training environment for the model. Iterating through a set number of episodes, it uses the model to sample actions and play them. When such a sequence ends, the model is using the recorded observations to update the policy.

Results

REINFORCE performs well in the CartPole environment. The figure below demonstrates the average reward as a function of the episode index. Here, the average reward is a direct representation of the episode length.

Final Thoughts

REINFORCE is a powerful algorithm, which demonstrates benefits, especially in learning in stochastic environments, as explained and shown above. However, it is not a fault-proof learning algorithm. One shortcoming, mentioned above, is its high variance. Thus, it is reasonable that REINFORCE will perform worse in less suitable environments (or more complex ones..).

This issue is not unique to REINFORCE since most RL algorithms demonstrate some tradeoffs that dictate their strengths and weaknesses. Therefore, very complex environments may present a challenge to many of the commonly used RL algorithms. A possible solution for that is ensemble learning.

Ensemble learning is a method of combining multiple learning models to produce a single, more robust learner. The following post — building on the implementation of REINFORCE presented here, this SARSA implementation, and this Q-leaning implementation — explores the potential of ensemble learning. See the complete repository here.

References

Ghimire, B. et al. (2012). An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS. Retrieved April 4th, 2020 from https://doi.org/10.1016/j.isprsjprs.2011.11.002

Silver, D. (2015). Lecture 7: Policy Gradient. UCL Course on RL.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.