The OpenAI Gym provides many standard environments for people to test their reinforcement algorithms. These environments include classic games like Atari Breakout and Doom, and simulated physical environments like this one where you have to keep a humanoid standing:
I decided to start from the Cart-Pole balancing problem. This is essentially the classic inverted pendulum problem which you could find in a typical undergraduate control course. However, instead of applying control theories, the goal here is to solve it using controlled trial-and-error, also known as reinforcement learning.
Modeling the Cart-Pole simulated environment
In here, we represent the world as a graph of states connected by transitions (or actions). It means that to predict your future state, you will only need to consider your current state and the action that you choose to perform. The key here is that you don’t need to consider your previous states. This is what people call a Markov Model. Although your past does have influences on your future, this model works because you can always encode information about the past in your current state.
In human, our state includes our memory which encodes everything that we have done (or we think that we have done) in the past. Not all states are fully observable, however. We often can only guess what the true states are from the observable states. For instance, I don’t know what you are thinking and can only infer it through your action. This is what people called a Hidden Markov Model.
The Cart-Pole world consists of a cart that moves along the horizontal axis and a pole that is anchored on the cart. At every time step, you can observe its position (x), velocity (x_dot), angle (theta), and angular velocity (theta_dot). These are the observable states of this world. At any state, the cart only has two possible actions: move to the left or move to the right.
In other words, the state-space of the Cart-Pole has four dimensions of continuous values and the action-space has one dimension of two discrete values.
Finding the optimal policy
In this environment, you get a reward as long as the pole is still somewhat upright and the cart is still within the bound. An episode is over as soon as the pole falls beyond a certain angle or the cart strays too far off to the left or right. The problem is considered “solved’ when it stays upright for over 195 time steps, 100 times consecutively.
A policy is simply the action that you take at a given state. Here, you want to find the policy that can maximize your reward. In other words, you want to find the optimal policy for each possible state.
Q-Learning is a method of finding these optimal policies. You can read more about it on this page. Essentially, through trials-and-errors, you find a Q-value for each state-action pair. This Q-value represents the desirability of an action given the current state. Over time, if the world is static (i.e. the physics or the cause-and-effects don’t change), the Q-values would converge and the optimal policy of a given state would be the action with the largest Q-value.
To use Q-Learning, you would have to discretize the continuous dimensions to a number of buckets. In general, you want to have fewer buckets and keep the state-space as small as possible. Having fewer optimal polices to find means faster training. However, discretizing the state-space too coarsely might prevent convergence as important information might be discretized away.
Q-Learning using only theta
My first thought was that I would only need theta, which is the angle between the pole and the vertical axis. After all, the goal is to keep the pole up right. Yes, the cart does move around, but the bound is sufficiently wide that it probably won’t go out before the pole falls over. I didn’t see how the linear position nor either of the velocities could be useful. Moreover, I thought that having a one-dimensional state-space would make populating the Q-table much quicker.
However, I was wrong. I wasn’t able to keep the pole upright for more than 60 time steps. I tried tweaking the learning rate, the number of buckets, exploration rate, etc. Nothing worked.
Then I thought; I do know the optimal policy. When the pole is tilted to the left, move to the left. If it is tilted to the right, move to the right. So I manually inputted this optimal policy, it didn’t work any better either! At that point, I realized that I needed the other state components as well.
Q-Learning using x, x_dot, theta, and theta_dot
I quickly looked up an MatLab implementation of the solution to gauge what were the number of buckets that I would need for each component. It turned out that it only used 3 buckets for x, x_dot, and theta_dot, and 6 buckets for theta!
With that I was able to solve that Cart-Pole problem within 681 episodes! It was a still a far cry from the best solution. Without changing the structure of the algorithm, I started wondering what I could do to solve the problem quicker.
I tried setting the decay of exploration rate faster. I thought it might be exploring for too long without exploiting. It didn’t really work. It was using sub-optimal policy too quickly.
When I was staring at the animation, I noticed that the cart didn’t really move out of bound that often. Also, I only needed to balance the pole for 200 time steps. It typically didn’t drift that far while balancing the pole. Therefore, I removed the x and x_dot components and leaving only the theta, and theta_dot components. With this significantly reduced state-space, I was able to solve the problem much quicker in only 136 episodes !
I could see that the cart was inching to the left or right every now and then. However, the movement was so slow that an episode was usually completed (reached 200 time steps) before it could get out of the bound. Dimensionality reduction was once again the key to faster learning!
Below is the code for this implementation. Also check out my next post on replacing the Q-table with a Q-network!