Ensemble Reinforcement Learning in OpenAI Gym

A four-part series on implementing Ensemble Learning in Cartpole using Q-learning, Deep SARSA, and Deep REINFORCE.

Gelana Tostaeva
3 min readApr 25, 2020

This post serves as a table of contents of our tutorial series. Here, we briefly go over the idea behind Ensemble RL and review the Cartpole environment. Each post can act as a standalone tutorial, but we promise the sum — the Ensemble RL — is more fun than its parts!

Our tutorial series will:

  • introduce ensemble reinforcement learning;
  • provide an overview and implementation of Deep SARSA (first post), Q-learning (second post), and Deep REINFORCE (third post) algorithms using Python and Keras;
  • walk you through comparing and combining the three algorithms to solve a simple OpenAI Gym’s Cartpole problem (final post).

You can find the full version of the code used here.

Who this is for: This series is for anyone who has had some prior, even if limited, exposure to reinforcement learning (RL). Our posts assume no knowledge of any of the algorithms, but to implement the models, you should be comfortable with Python and Keras. We will not go in-depth on OpenAI Gym, but it should be easy to follow regardless of your background.

the Ensemble Learning approach

Very complex environments may present a challenge to many of the commonly used RL algorithms. A possible solution for that is ensemble learning.

Ensemble learning is a method of combining multiple learning models to produce a single, more robust learner. We will go into more detail on what this is in our final post. For now, we will start with a simple environment and focus on understanding the learning models — the three algorithms — we combine into our Ensemble.

the environment

We will use OpenAI Gym’s Cartpole environment for our implementations. It is essentially an Inverted Pendulum problem where our goal is to keep the cartpole balanced at the center — the pendulum is swinging but can be kept in place through appropriate horizontal forces.

Figure 1: A randomly swinging cartpole — the pendulum in our problem. The purple-colored pivot point is moved horizontally.

As seen in Figure 1, there are only two possible actions the Cartpole agent can take: move left or move right. The states in this environment include the position of the pivot point, its velocity, angle, and angular velocity; without going into physics, the thing to infer here is that our state space has continuous values and four dimensions.

We get a reward as long as our Cartpole is not falling down. Once that happens, the game episode is over. To win, we need to find a behavior policy that allows our pole agent to keep upright — in other words, keep getting a consistently high reward (the maximum is 500 in Cartpole version 1 we are using). Let’s see if we can do that using Ensemble RL.

the algorithms

As mentioned, we will be using Q-learning, Deep SARSA, and Deep REINFORCE to solve the Cartpole environment. Check out the relevant tutorials on what these are and how they can be implemented in Python and/or Keras:

  • Deep SARSA: this tutorial will show you how to use Keras-RL to build a Neural Network-based version of SARSA;
  • Q-learning: this tutorial will review what Q-learning is and how to implement it in Python;
  • Deep REINFORCE: this tutorial will introduce the policy gradient method and guide you through a Python implementation in Keras.
Figure 2: The Ensemble model combining the three trained agents, including SARSA, Q-Learning, and REINFORCE.

Finally, we combine the models from the three tutorials above in an Ensemble, as shown in Figure 2 —find our step-by-step guide on how to do that here. We go over some of the reasons behind using Ensemble learning and provide a comparative analysis of the algorithms and the Ensemble.

We hope you find our tutorial series useful. This was a collaborative effort between my classmates at Minerva; Pepe wrote about Q-learning, I did SARSA, Gili built a Deep REINFORCE model, and Ash put it all together in an Ensemble.

--

--

Gelana Tostaeva

a [wannabe] computational neuroscience student hoping & trying to make learning effective and personalized while traveling the world with Minerva. @gelana_t