The Inverted Pendulum Problem with Deep Reinforcement Learning

A look into Keras-RL and OpenAI libraries

Saif Uddin Mahmud
Dabbler in Destress
10 min readMar 12, 2019

--

A professor of mine introduced me to the rather simple inverted pendulum problem — balance a stick on a moving platform, a hand let’s say. Intuition built on the physics of the “Game Engine” in our head tells us: if the stick is leaning to the left, move your hand to the left; if the stick is leaning to the right, move your hand to the right. We, humans, are exceptional at learning new novel tasks like these with very few sample points. Could a simple Arduino balance it?

Control Theory is the obvious way to go, and having had prior experience tinkering with PID, my overconfident self walked straight into Steve Brunton’s excellent Control Bootcamp…only to return disillusioned. The task was not trivial and involved heavy math. Well if I couldn’t learn it, I’ll let the machines learn it themselves. I tried again with David Silver’s Reinforcement Learning class. Now Q-Learning and Policy Methods based on Markov Decision Processes are cool and all, but they still seemed unwieldy for continuous state spaces like the inverted pendulum, especially to a lazy programmer like me. How could I get this inverted pendulum up and running as soon as possible without all this pain? Enter Deep Reinforcement Learning, which is basically letting a neural network learn how to approximate functions used in Reinforcement Learning. Third time’s the charm. Let’s get started.

Shout out to Thomas Simonini’s amazing series on Medium which taught me most of the basics demonstrated here.

Reinforcement Learning

The idea behind Reinforcement Learning is to model how human beings learn. We try actions on the current state of our environment and receive a reward. After a few trials, we begin predicting what next state we will land on given the current state and preferred action. With all this information reinforced, given a state we know what action to take which maximizes our immediate reward, and future reward since we know where we’ll end up. RL tries things, again and again, to create a lookup-table with this information — and later uses it to make decisions once training is over.

For our particular inverted pendulum possible actions are [go_right, go_left], environment is the simulation, state is the [angleOfStickWithVertical, angularVelocityofStick, positionOfPlatform, velocityOfPlatform]. The system rewards you for each “moment” you keep the pendulum upright with 1 point until it topples over.

Deep Q Network

For my experiments, I used OpenAI gym’s Cartpole Environment and Keras-rl. OpenAI gym is a library of simulations and Atari games made for RL tinkerers like us. Keras-rl is a deep RL library that does all the heavy-implementation-lifting for you and lets you focus on tuning your Neural Network. Perfect!

Edit (28/08/2022):
Admittedly, I've lost touch with the field of Reinforcement Learning since I've written this article. Seems to me that the number of tools available for enthusiasts and practitioners has grown quite a bit.
Here's a site listing the tools down: https://neptune.ai/blog/the-best-tools-for-reinforcement-learning-in-python

For our first approach, we’ll use a Deep Q-Network, or DQN for short. I used a relatively simple network with 3 hidden layers of 16 relu neurons each. As stated before, we have 4 inputs and 2 outputs.

Image Courtesy of https://medium.freecodecamp.org/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682

The original Cartpole-v1 has a max_step_limit of 200. However, as I trained the first Neural Nets, I noticed that the agent scored full points even though it was unstable. Had it been simulated for 10 more steps, the pendulum surely would have dropped. I increased the max_epsiode_steps to 500 to see if it stops “cheating”.

Here we setup the “Experience Replay”. The Agent plays the game, and stores all the information as an object in this memory variable. When it needs to train, it samples mini_batch_size replays from this variable at random.

Here we specify the policy or rule according to which the agent acts. Think of this as our lookup table. The EpsGreedyQPolicy() tells our agent to take the action with the greatest Q-value (potential reward) at a given state. This is the Exploitation part. Linearly Annealing means that our agent will start off curious, exploring the different actions it can take, and slowly reducing the exploration rate to 0.1 over 10,000 steps. There’s this nice analogy from the book Algorithms to Live By, where it says that when you move to a new city, you’re likely to try out a lot of places(exploration); but when you’re moving away, you’ll go to your trusted old-favorites(exploitation).

Lots happening on this one line of code. Let’s break it down:
nb_steps_warmup asks our DQNAgent to play the specified number of steps with the randomly initialized policy and gather some experience for the training.

For target_model_update we’ll need to look at the loss function, and some of the original Q-Learning math. The Bellman equation says that the Q-value is the expected future reward given a state and an action:

Image courtesy of https://medium.freecodecamp.org/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc

In Q-Learning, the following equation dictates how we update iteratively to converge to the actual Q-table:

Image courtesy of https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8

For our DQN loss, the error in the Q-value is calculated with the difference between our Q_target(target_model) and our Q_value(current_model):

Image courtesy of https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8

If we update our target_model after each episode, our neural network will be trying to hit a target that is always changing. That is not good for convergence. Instead, with the target_model_updateparameter, we make sure the target is more stable. There are two ways of doing it: update target_model after a certain number of steps(if target_model_update => 1), or change target_model by a little by keeping weighing the old_model and the new_model (elif target_model_update < 1). We go with the second mode since it seems to work better in our trials.

gamma is the discounting factor used in calculating our reward. From state at time t, we can predict the reward at t+1 and new state at t+1, and thus we can predict the reward at t+2 as well and so on till infinity. However, we can’t be certain of the future rewards if our environment is stochastic (consists of unmodeled/random uncertainties) instead of deterministic. Thus we care about instant gratification more than expected future returns. Now how much you value future rewards in relation to immediate reward is controlled by gamma: the higher the gamma, the more the future is valued.

We train with Adam at 0.001 learning_rate, and train our model for 100,000 steps (variable number of episodes). Time for the results! Let’s see how our agent does with:

Not bad at all. It took some time to tune the hyperparameters to handle 500 steps, but the end result after 10 minutes of training was consistently scoring 500/max. The agent does slowly move to the right on every simulation, and it also does this weird jerky movement seemingly unnecessarily in the middle.

The Forgetting Problem

Not all is fine, however. I showed you the performance of the agent that performed well, not the performance of the agent after the training ended. Something weird happens in between. Take a look for yourself:

reward (y-axis) vs #epiodesPlayed (x-axis)

The agent learns and forgets repeatedly. Thankfully, I wasn’t the first one to have stumbled onto this, and there is this comprehensive 3-part-article by Adam Green that tests various hypotheses to try and solve it. None of the tricks worked for me, though. I agree with Adam’s conclusion that some form of early_stopping may solve the problem.

Dueling DQN

The Q-value represents the value of taking an action, a, given a state, s. We can thus break down the Q-value as the sum of the Value of being at the current state, V(s), and Advantage of taking an action from here, A(s, a). Now instead of letting our DQN do it together, we can split the second last layer into a V(s) network and an A(s, a) network for decoupling the effect, and potentially easier learning. We can do it explicitly with our Model API in Keras, but Keras-RL takes care of it with a single variable enable_deuling_network set to True. If you look at the source code on their GitHub repository, you’ll see that it basically splits the second last layer into (numNeurons-numActions) and (numActions) and then adds an aggregating layer before the output.

Image courtesy of https://medium.freecodecamp.org/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682

Let’s change the DQNAgent line and compare its performance.

The dueling DQN seemed to be performing worse during the learning phase and suffers tragically with the forgetting problem too. But it somehow (perhaps luckily) smoothed out at the last moment. It also solves the environment a bit differently, drifting to the left this time, and jerking at the start.

Double DQN

If you’ve been following along, you notice that we take an action and evaluate the same action’s value using the same network. This seems to overestimate the value of a particular action and through positive feedback, blows it up. Double DQN trains 2 networks, let's call them Q1 and Q2. When we choose an action using Q1, we evaluate it using Q2 — and vice versa.
I’ll let the author of the Double Q Learning paper explain as on a reddit thread:

More precisely, we update one value function, say Q1, towards the sum of the immediate reward and the value of the next state. To determine the value of the next state, we first find the best action according to Q1, but then we use the second value function, Q2, to determine the value of this action.

Similarly, and symmetrically, when we update Q2 we use Q2 to determine the best action in the next state but we use Q1 to estimate the value of this action.

The goal is to decorrelate the selection of the best action from the evaluation of this action. You don’t need two symmetrically updated value functions to do this. In our follow-up work on Double DQN ( https://arxiv.org/abs/1509.06461 ) we instead used a slow moving copy to evaluate the best action according to the main Q network. This turns out to decorrelate the estimates sufficiently as well.

We just set enable_double_dqn=True when instantiating our DQNAgent, and remove the dueling arguments. Will it fix our forgetting problem? Spoiler alert: it doesn’t. The performance seems to be smoother though. Take a look for yourself.

Policy Gradient

So far, we’ve been looking at value-based RL, where the policy is indirectly derived depending on the value we think an action might have at a given state. Why not get rid of the value-based learning, and focus on policy-based RL? That’s exactly what Policy Gradient is.
Policy based RL works better on stochastic environments (Partially Oberservable Markov Decision Process), and work better at continuous action spaces, has no need for coding in exploration/exploitation and can differentiate between Perpetual Aliasing (two different states that look the same). We can train Policy Based NN using gradient ascent.

Policy Gradient suffers from the same problem Gradient Descent does: being stuck on local optima. A bigger problem of Policy-based methods is that it learns only after each episode. It does make it slower, but the real problem lies elsewhere. Suppose in an episode, you take 50 actions, and 5 of them were terrible actions to take. At the end of the episode, all 50 actions will have had their policy-score increased. This means that the policy-based method needs a lot more examples to learn properly.

Actor-Critic / A2C

Actor-Critic Networks are hybrids between Policy Gradients and DQNs and aim to take advantage of both types. We have a value-based Critic Network that evaluates the actions and a policy-based Actor Network that decides what actions to take. This cute, graphical introduction to A2C will reinforce ;) a lot.

Unfortunately, Keras-RL A2C outputs aren’t compatible with Cartpole, and it will require a bit of change in the source code. Let’s instead implement this on the new Tensorflow with tf-agent. Keep an eye out here for the next article!

In this article, we’ve seen a broad overview of reinforcement learning and used some of the state-of-the-art techniques without the need to implement the complex math from scratch. We’ve also stumbled upon the forgetting problem, but haven’t been able to fix it…yet.

Future Work

In the future, I want to:
1. Implement PG and A2C on tf-agents
2. Deploy the models on physical hardware and retrain to see how well transfer learning translates from simulation to the real world.
3. Explore more complex problems with RL (will require considerable compute power)

--

--