Deep Reinforcement Learning: Playing CartPole through Asynchronous Advantage Actor Critic (A3C) with tf.keras and eager execution
By Raymond Yuan, Software Engineering Intern
In this tutorial we will learn how to train a model that is able to win at the simple game CartPole using deep reinforcement learning. We’ll use tf.keras and OpenAI’s gym to train an agent using a technique known as Asynchronous Advantage Actor Critic (A3C). Reinforcement learning has been receiving an enormous amount of attention, but what is it exactly? Reinforcement learning is an area of machine learning that involves agents that should take certain actions from within an environment to maximize or attain some reward.
In the process, we’ll build practical experience and develop intuition around the following concepts:
- Eager execution — Eager execution is an imperative, define-by-run interface where operations are executed immediately as they are called from Python. This makes it easier to get started with TensorFlow, and can make research and development more intuitive.
- Model subclassing — Model subclassing allows one to make fully-customizable models by subclassing tf.keras.Model and defining your own forward pass. Model subclassing is particularly useful when eager execution is enabled since the forward pass can be written imperatively.
- Custom training loops
We’ll follow the basic workflow:
- Build our master agent supervisor
- Build our worker agents
- Implement the A3C algorithm
- Train our agents
- Visualize our performance
Audience: This tutorial is targeted towards anybody interested in reinforcement learning. While we won’t go into too much depth into the basics of machine learning, we’ll cover topics such as policy and value networks at a high level. In addition, I suggest reading the paper, Asynchronous Methods for Deep Reinforcement Learning by Volodymyr Mnih, which is a great read and provides much more detail into the algorithm.
What is CartPole?
Cartpole is a game in which a pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The starting state (cart position, cart velocity, pole angle, and pole velocity at tip) is randomly initialized between +/-0.05. The system is controlled by applying a force of +1 or -1 to the cart (moving left or right). The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
Establishing a baseline
It’s often very useful to establish a baseline in order to properly judge your model’s actual performance and the metrics by which you evaluate it. For example, your model may seem to be performing well when you see high scores being returned, but in reality the high scores may not be reflective of a good algorithm or the result of random actions. In a classification example, we can establish baseline performance by simply analyzing the class distribution and predicting our most common class. But, how do we establish a baseline for reinforcement learning? In order to do this, we will create a random agent that will simply perform random actions in our environment.
For the game CartPole we get an average of ~20 across 4000 episodes. To run the random agent, run the provided py file:
python a3c_cartpole.py — algorithm=random — max-eps=4000.
What is the Asynchronous Advantage Actor Critic algorithm?
Asynchronous Advantage Actor Critic is quite a mouthful! Let’s start by breaking down the name, and then the mechanics behind the algorithm itself.
- Asynchronous: The algorithm is an asynchronous algorithm where multiple worker agents are trained in parallel, each with their own copy of the model and environment. This allows our algorithm to not only train faster as more workers are training in parallel, but also to attain a more diverse training experience as each workers’ experience is independent.
- Advantage: Advantage is a metric to judge both how good its actions were, but also how they turned out. This allows the algorithm to focus on where the network’s predictions were lacking. Intuitively, this allows us to measure the advantage of taking action, a, over following the policy π at the given time step.
- Actor-Critic: The Actor-Critic aspect of the algorithm uses an architecture that shares layers between the policy and value function.
But how does it work?
At a high level, the A3C algorithm uses an asynchronous updating scheme that operates on fixed-length time steps of experience. It will use these segments to compute estimators of the rewards and the advantage function. Each worker performs the following workflow cycle:
- Fetch the global network parameters
- Interact with the environment by following the local policy for min(t_max, steps to terminal state) number of steps.
- Calculate value and policy loss
- Get gradients from losses
- Update the global network with the gradients
With this training configuration, we expect to see a linear speed up with the number of agents. However, the number of agents your machine can support is bound by the number of CPU cores available. In addition, A3C can even scale to more than one machine, and some newer research (such as IMPALA) supports scaling it even further. Adding more may prove detrimental to speed and performance. Check out the paper for more in-depth information!
Refresher for policy and value functions
If you are already comfortable with policy gradients, then I would recommend skipping this section. Otherwise, if you don’t know what policies/values are or just want a quick refresher, read on!
The idea of a policy is to parametrise the action probability distribution given some input state. We do so by creating a network that takes the state of the game and decides what we should do. As such, while the agent is playing the game, whenever it sees a certain state (or similar states), it will compute the probability of each available action given an input state, and then sample an action according to this probability distribution. To delve into the mathematics more formally, policy gradients are a special case of the more general score function gradient estimator. The general case is expressed in the form of Ex p(x | ) [f(x)]; in other words and in our case, the expectation of our reward (or advantage) function, f, under some policy network, p. Then, using the log derivative trick, we figure out how to update our network’s parameters such that the action samples obtain higher rewards and end up with ∇ Ex[f(x)] =Ex[f(x) ∇ log p(x)]. In English, this equation explains how shifting θ in the direction of our gradient will maximize our scores according to our reward function f.
Value functions essentially judge how good a certain state is to be in. Formally, value functions define the expected sum of rewards when starting in state s and following policy p. This is where the “Critic” part of the name becomes relevant. The agent uses the value estimate (the critic) to update the policy (the actor).
Let’s first define what kind of model we’ll be using. The master agent will have the global network and each local worker agent will have a copy of this network in their own process. We’ll instantiate the model using Model Subclassing. Model subclassing gives us maximum flexibility at the cost of higher verbosity.
As you can see from our forward pass, our model will take inputs and return policy probability logits and values.
Master Agent — Main thread
Let’s get to the brain of the operation. The master agent holds a shared optimizer that updates its global network. This agent instantiates the global network that each worker agent will update as well as the optimizer that we will use to update it. A3C is shown to be quite resilient to a spread of learning rates, but we’ll use the AdamOptimizer with a learning rate of 5e-4 for the game CartPole.
The master agent will run the train function to instantiate and start each of the agents. The master agent handles the coordinating and supervision of each agent. Each of these agents will run asynchronously. (This is technically not true asynchrony because in Python, because of GIL (Global Interpreter Lock) a single python process cannot run threads in parallel (utilize multiple cores). It can however run them concurrently (context switch during I/O bound operations). We implement with threads for the sake of simplicity and clarity of example.)
Memory Class — Holding our experience
In addition, to make keeping track of things easier, we’ll also implement a
Memory class. This class will simply give us the functionality to keep track of our actions, rewards, states that occur per step.
Now, we get to the crux of the algorithm: the worker agent. The worker agent inherits from the threading class, and we override the
run method from Thread. This will allow us to achieve the first A in A3C, asynchronous. First we’ll begin by instantiating a local model and setting up the specific training parameters.
Running the algorithm
The next step is to implement the
run function. This will actually run our algorithm. We’ll run all the threads for a given global maximum number of episodes. This is where the third A, actor, of A3C comes into play. Our agent will “act” according to our policy function, becoming the actor while the action is judged by the “critic,” which is our value function. While this section of the code may seem dense, it really isn’t doing much. Within each episode, the code simply does:
- Get our policy (action probability distribution) based on the current frame
- Step with action chosen according to policy
- If the agent has taken a set number of steps (
args.update_freq) or the agent has reached a terminal state (has died) then:
a. Update the global model with gradients computed from local model
How are the losses computed?
The worker agent calculates the losses to obtain gradients with respect to all of its own network parameters. This is where the last A of A3C come into play, the advantage. These are then applied to the global network. The losses are calculated as:
- Value Loss: L=∑(R — V(s))²
- Policy Loss: L = -log(𝝅(s)) * A(s)
Where R is the discounted rewards, V our value function (of an input state), 𝛑 our policy function (of an input state as well), and A our advantage function. We use the discounted rewards to estimate our Q value since we don’t directly determine the Q value with A3C.
That’s it! The worker agent will repeat the process of resetting the network parameters to all of those in the global network, and repeating the process of interacting with its environment, computing the loss, and then applying the gradients to the global network. Train your algorithm by running the command:
python a3c_cartpole.py — train.
Testing the algorithm
Let’s test the algorithm by spinning up a new environment and simply following our policy output by our trained model. This will render our environment and sample from the policy distribution from our model.
You can run this with the following command:
python a3c_cartpole.py once you’ve trained your model.
To examine our moving average of scores:
We should see a score converged with scores >200. The game is defined as being “solved” as getting average reward of 195.0 over 100 consecutive trials.
And an example performance on a new environment:
What we covered:
- We solved CartPole by implementing A3C!
- We did so by using Eager Execution, Model Subclassing, and Custom Training loops.
- Eager is an easy way to develop training loops that makes coding easier and clearer since we’re able to print and debug tensors directly.
- We learned the basics of reinforcement learning with policy and value networks, and then we tied them together to implement A3C.
- We iteratively updated our global network by applying our optimizer’s update rules using tf.gradient.