Learning with Deep SARSA & OpenAI Gym

A guide through implementing the deep learning SARSA algorithm in OpenAI Gym using Keras-RL.

Gelana Tostaeva
The Startup
6 min readApr 25, 2020

--

This tutorial will:

  • provide a brief overview of the SARSA algorithm in its general form;
  • motivate the deep learning approach to SARSA and guide through an example using OpenAI Gym’s Cartpole game and Keras-RL;
  • serve as one of the initial steps to using Ensemble learning (scroll to the end to find out more!).

You can find the full version of the code used here.

Who this is for: Similarly to our previous post, this tutorial is for anyone who has had some prior, even if limited, exposure to reinforcement learning (RL). This post assumes no knowledge of SARSA, but to implement it, you should be comfortable with Python and Keras. We will not go in-depth on OpenAI Gym, but it should be easy to follow regardless of your background.

SARSA: the basics

In a previous post, we looked at Q-learning as a method for optimizing the value function of an unknown state. This is done by determining how we can learn a good policy, and SARSA presents another way of doing that.

Unlike Q-learning, SARSA — or State-Action-Reward-State-Action — is an on-policy method: its update is done using the value of the next state and the action of the current policy. Here lies the on-policy assumption that the agent follows the current policy and estimates the state-action pairs accordingly. Recall that Q-learning uses the greedy action instead; the two algorithms behave similarly if the current policy a SARSA agent follows is a greedy one.

Here’s a simple way to think about the distinction between the on-policy vs off-policy learning. Any RL agent can have two policies: behavior and learning. The behavior policy is used for generating actions to interact with its environment by sampling it; the learning policy is what the agent learns by such interactions. In SARSA the two are equal; in off-policy methods, like Q-learning, they are not.

Knowing how this learning policy is updated is key to understanding SARSA. Formally, this update involves updating the Q-value estimates of the state-action pairs, i.e., Q(s,a).

Suppose we have two states. Here’s what a SARSA agent would do:

  • state one: starts here, takes action one and gets a reward;
  • state two: after transitioning here, performs another action two and gets another reward from this second state.

For comparison, Q-learning does something less straightforward: a Q-learning agent updates the value of its first action in state one after seeing the highest action possible in the second state. If this is not clear, check out our first RL tutorial on this topic; we will not be discussing Q-learning in detail here.

We see this process of updating the Q-value — Q(s,a) — based on the state-action returns from the current state illustrated in the SARSA algorithm:

Figure 1: The SARSA algorithm in its most general form.

The specific step of the update is underlined: we estimate the Q(s,a) using the reward, r, determined by the behavior policy in the current state. It is likely the only thing you need to know about SARSA before trying to implement it — we go over how exactly after introducing the OpenAI Gym environment we are working with.

OpenAI Gym: the environment*

*this section is included in the Ensemble RL tutorial series introduction post — skip it if you’ve already read it!

We will use OpenAI Gym’s Cartpole environment for our implementation. It is essentially an Inverted Pendulum problem where our goal is to keep the cartpole balanced at the center — the pendulum is swinging but can be kept in place through appropriate horizontal forces.

Figure 2: A randomly swinging cartpole — the pendulum in our problem. The purple-colored pivot point is moved horizontally.

As seen in Figure 2, there are only two possible actions the Cartpole agent can take: move left or move right. The states in this environment include the position of the pivot point, its velocity, angle, and angular velocity; without going into physics, the thing to infer here is that our state space has continuous values and four dimensions.

We get a reward as long as our Cartpole is not falling down. Once that happens, the game episode is over. To win, we need to find a behavior policy that allows our pole agent to keep upright — in other words, keep getting a consistently high reward (the maximum is 500 in Cartpole version 1 we are using). Let’s see if we can do that using Keras-RL.

SARSA: the Neural Network take

For our SARSA agent class, we will be using the original Keras-RL implementation which you can find here.

As briefly mentioned, our SARSA model serves as one of the initial steps of doing Ensemble learning (further explored in this post). To get there, we needed to output a vector of probabilities of taking a particular action. Thus, we had to add a minor modification to the Keras-RL agent, specifically:

  • add a q_values attribute to the SARSA class;
  • add a way to save the probabilities in a normalized vector during training.

To see how this was done in Python, please see the highlighted parts in the full code here. We will focus our tutorial on actually using a simple neural network SARSA agent to solve the Cartpole problem.

The first step is to initialize our environment. Before doing so, make sure you import the Keras, Gym, and Numpy libraries. We also fix the seed for reproducibility.

As described, our game has 2 actions and 4 states. We will need these to create our Neural Network model.

For simplicity (and this is a simple environment!), we build a feedforward Keras model with just two hidden Dense layers. For our input, we are using a Flatten layer that has four input neurons (as we have four states); the shape transformation is needed since we are dealing with continuous values. The maximum possible reward is returned in our two-neuron output layer, according to the two possible actions; we use a linear activation to get such output.

Because we assumed you have used Keras before, we will not go into more details about the network; if you need, here’s a good refresher.

Once we have our Neural Network, we initialize our Keras-RL Sarsa agent using the EpsGreedyQPolicy() policy which dictates that the agent simply takes a random action with the probability defined by the epsilon (the Eps part) value.

And that is it! Let’s plot our rewards to see how our model did.

Figure 3: Total rewards in testing over all game episodes.

Interestingly, our agent behaves rather strangely: it gets a high reward somewhat consistently except for the random time it fails to keep it up. This happens during the testing episode 270 when our agent only performs 38 steps — compared to the 140–160 range — before automatically resetting the environment; as a result, the SARSA algorithm only achieves the local minimum. Regardless, our model is still staying within the positive reward domain so all is not completely lost, and the cartpole is kept balanced.

We hope this tutorial has been hopeful in guiding through a Neural Network implementation of SARSA. Although the applicability of a deep learning agent in the chosen environment is debatable, it still presents a simple way of training your RL skills. We encourage you to try to play around with the network model to try to improve the performance.

Ensemble learning is another potential method for improving performance, especially in more complex environments. It works by combining several models to create a more effective learner. Check out our Ensemble learning tutorial series on how it can be done. Building on the presented implementation of Deep Sarsa, this of Q-learning and this of Deep REINFORCE, the final post of our series will guide you through a more advanced RL approach.

References

--

--

Gelana Tostaeva
The Startup

a [wannabe] computational neuroscience student hoping & trying to make learning effective and personalized while traveling the world with Minerva. @gelana_t