Using Q-Learning to solve the CartPole balancing problem

Jose Nieves Flores Maynez
7 min readApr 26, 2020

--

This tutorial will:

  • provide a brief overview of the Q-learning algorithm;
  • explain step-by-step a python implementation of the Q-learning algorithm;
  • Serve as one of the initial steps to using Ensemble learning (scroll to the end to find out more!).

You can find the full version of the code used here.

Who this is for: This tutorial is for anyone with some prior exposure to reinforcement learning. While no knowledge of the Q-learning algorithm is needed, you should be somewhat comfortable using Python. There is no in-depth explanation of what Open-AI Gym is but it should be easy to follow regardless of your background.

How does a Q-Learning algorithm learn?

Q-learning is an algorithm that relies on updating its action-value functions. This means that with Q-learning, every pair of state and action have an assigned value. By consulting this function at each step, the agent is able to make decisions that would tell them which action to take. Every time a step is taken, the value of the function is updated. Something useful about using Q-learning is that we are guaranteed to converge to the solution with a probability of 1, as long as we keep exploring and updating the values of the function.

from Sutton Barto book: Introduction to Reinforcement Learning

At every step when following this algorithm while it’s learning, the action-value function must be updated as it is shown above. We only update the values for the pairs of state we visit and action we make. This updated value is dependent on several other factors. Firstly, α is the value of our learning rate, which helps determine how much importance we give to new information. R_t+1 is the reward that we get from being in the next state which is reached after taking the action at the state in which we currently are. γ is our discount factor and it must always have a value between 0 and 1. This tells us how much we value the rewards of future states. We can change the parameters of Q-learning by modifying its learning rate, and its discount factor. However, Q-learning always assumes that the next state is as valuable as its best possible action.

from Sutton Barto book: Introduction to Reinforcement Learning

The image above shows some pseudocode of how the Q-learning algorithm is implemented, detailing how on each episode, and at every step, an action is chosen based on the current state and a policy based on the action-value function Q. After taking the action, we receive a reward and arrive at the next state. This information is used to update the action-value function, and, after doing so, we make the next state our current state and follow through until we reach the final state of the final episode. This is implemented on Python for the CartPole-v0 problem and each of the steps is explained below.

Gym’s cart pole trying to balance the pole to keep it in an upright position.

Implementation

Since this algorithm relies on updating a function for each existing pair of state and action, environments that have a high state-space become problematic. This is because we can approximate better the actual value of a state-action pair as we visit it more often. However, if we have many states or many actions to take, we distribute our visits among more pairs and it takes much longer to converge to the actual true values. The CartPole environment gives us the position of the cart, its velocity, the angle of the pole and the velocity at the tip of the pole as descriptors of the state. However, all of these are continuous variables. To be able to solve this problem, we need to discretize these states since otherwise, it would take forever to get values for each of the possible combinations of each state, despite them being bounded. The solution is to group several values of each of the variables into the same “bucket” and treat them as similar states. The agent implemented for this problem uses 3, 3, 6, and 6 buckets respectively.

This discretize_state(obs) function takes an observation from the environment and transforms it into something more manageable. By combining states that are very similar and treating it as the same, we have reduced the state space and made the Q-table smaller and easier to fill.

On line 20 of the image above, step 1 of the pseudocode is completed. The Q(s, a) is created and the value for all of the pairs of states and agents are made equal to 0.

The rest of the implementation of the Q-learning pseudocode is found under the train function. We start on line 142 by specifying that this will loop for all episodes as line 2 of the pseudocode says. The state is then initialized on line 144, corresponding to the third step. The fourth line of the pseudocode asks to open another loop for each of the steps in the episode, and we do that on line 151. It should be noted that the loop for each of the steps depends on whether the done variable changes to False and it doesn’t refer explicitly to the number of steps as we did with the episodes. This will be explained a bit later on.

In the 5th line of the pseudocode, it tells us to use some policy derived from the Q-function to choose an action. This is accomplished on line 154 of our code, and in this case, the policy being used is epsilon-greedy. The epsilon-greedy strategy consists of taking the action that has the highest value at each state. However, there is always a chance of a size epsilon that the agent will just act randomly. Therefore, there is always a chance of a size epsilon that in that step, the agent will do an exploratory action rather than exploit the information already collected.

The action is taken on line 156 and we obtain a reward and a new state as the 6th step of the pseudocode describes. However, this new state hasn’t been processed and we need to discretize it. Here is where the “done” variable comes into play. The loop only takes a single action before checking whether “done” is false again. This is because when we reach the new state, the gym environment itself tells us whether we reached a terminal state by either dropping the pole, moving out of the bounds or having reached 200 steps.

Finally, lines 159 and 160 correspond to steps 7 and 8 respectively. Line 7 is calling a function that does the calculations described at the beginning of the post, while line 160 simply updates what the current state is so that we can move to the next step.

Results

During its training, this instance of our agent was able to complete 261 episodes, meaning it reached 200 steps without dropping the pole. We can see how in the first episodes, the rewards stay low as the agent is still exploring the state-space and learning the values for each state-action pair. However, as we complete more episodes, the agent’s performance keeps improving and more episodes are successfully completed.

Conclusion

While this agent seems to be performing well already, its performance can be further improved. As mentioned previously, one of the limitations of this method is that we are forced to reduce the amount of information we feed our agent when we discretize the states. There are many things that could change the performance of our agent and not all of them are related to chance. The number of buckets for each variable describing the state, the rate at which the learning rate and epsilon decay, and the minimum value that epsilon and the learning rate can reach are all parameters that we can change in the agent and that can drastically change its performance.

There are many other algorithms that are also suitable for solving these problems, each with their own strengths and weaknesses. You can find an implementation to solving this same problem using SARSA here and one using REINFORCE in this other post.

Once you understand how these algorithms work differently, you can take a look at this post on ensemble learning. This is a method that combines multiple learning models to produce a single more effective learning.

References

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

--

--