Introduction to Q-learning with OpenAI Gym

A step-by-step guide to using Q-learning to solve a simple Taxi-3 environment in OpenAI Gym

Gelana Tostaeva
The Startup
9 min readApr 25, 2020

--

This tutorial will:

  • introduce Q-learning and explain what it means in intuitive terms;
  • walk you through an example of using Q-learning to solve a reinforcement learning problem in a simple OpenAI Gym environment.

You can find the full version of the code used here.

Who this is for: Anyone who wants to see how Q-learning can be used with OpenAI Gym! You do not need any experience with Gym. We do, however, assume that this is not your first reading on reinforcement learning, i.e., you have read about Temporal Difference learning more generally and want to learn about Q-learning more specifically. For the implementation, you should be comfortable with Python and have it installed in its 3rd version if you want to follow along.

Q-learning: the intuition

As you have probably read elsewhere, Q-learning is an off-policy algorithm meant to determine the best action given the current state. What does this mean exactly?

  • off-policy means that a Q-learning agent, rather than simply following certain rules of behavior — also known as policy, — can take random actions;
  • best action assumes that the action will result in the highest reward;
  • current state is the present situation the agent is in.

What Q-learning does is measure how good a state-action combination is in terms of rewards. It does so by keeping track of a Q-table, a reference matrix, or simply a look-up table, that gets updated after each episode with its row corresponding to the state and its column to the action. An episode ends after a set of actions is completed. In the end, the Q-table suggests the optimal policy.

Let’s go through this step-by-step, as illustrated in Figure 1 below.

  1. We first initialize our Q-table by assigning a value of 0 to all entries. We assume that all states have an equal chance and the same, uniform value.
  2. Next comes the important step of choosing an action. We have two options. First, we can choose one randomly; it is a good idea to do so if we are just starting to train our agent as it ensures that the agent is able to explore all with equal probability. However, if we already have some information about the value of an action in a given state, it may be more optimal to exploit that information and take the best action. This presents the second option — choose the highest-value action. If you are already thinking that there is more to this, you are right: it is the exploration-exploitation trade-off! We will revisit this after we are done going through the main steps, so hold onto that thought.

3 & 4. Once determined, we take the action and observe the reward it yields.

5. Finally, we update our Q-table using a special mathematical formula.

Figure 1: A simple Q-learning algorithm

Pretty straightforward, right? Now onto a few (less straightforward) details.

As briefly hinted on, with Q-learning we face the exploration-exploitation trade-off. If you are not familiar with this concept, you can think of it as a decision between doing what you know has worked in the past (exploit) and doing something new with the hope of gaining more (explore). It is a particularly relevant consideration when the agent lacks complete information about its world. Q-learning is a model-free approach — meaning that its agent cannot know the value of an action before it performs it — so the tradeoff is of special concern.

The way we resolve this in Q-learning is by introducing the epsilon greedy algorithm: with the probability of epsilon, our agent chooses a random action (and explores) but exploits the known best action (which is greedy) in all other cases of 1-epsilon.

Before we see how this is implemented with an example, there is one more thing we should address. The special mathematical formula used for the update in Step 5 is known as the Bellman equation, or:

where alpha is the learning rate and gamma is the discount factor; s, a, r refer to state, action, and reward, respectively.

If you have understood everything thus far, the only missing piece to getting the Bellman equation is the discount factor. Rewards lose their values over time, and the discount factor reflects that. It is the time value of money if you want a real-world analogy.

We use the equation to determine the value of the maximum reward expected for each state in the Q-table. We do so by taking the received reward r and the future state s’.

  • The first term, Q(s_t, a_t), is the value of the current action in the current state.
  • The second is a bit more complicated: we combine the current reward, r_t, and the discounted value of the future state given the highest-reward-yielding action, γ max_a Q(s_{t+1}, a_t). We then take away the current state value (underlying the first term). This second term is multiplied by α, the learning rate, which is simply the importance we put on the future value compared to the present one (hence the (1- α) is multiplied by the first term).

And with that, we are ready to proceed to see how we can implement this using a step-by-step example. We will first briefly describe the OpenAI Gym environment for our problem and then use Python to implement the simple Q-learning algorithm in our environment.

OpenAI Gym: the environment

To make sure we are all on the same page, an environment in OpenAI gym is basically a test problem — it provides the bare minimum needed to have an agent interacting with a world. The primary purpose is to test the agent and learning algorithms without having to worry about simulating the environment. For the sake of simplicity of this tutorial, we will consider the Taxi problem, a fairly basic, discrete environment with a limited number of possible actions and relatively small state space.

The goal is simple: our agent is a taxi (driver or self-driving car — you choose), and it needs to deliver a passenger to their destination. To visualize, we need to call the environment by rendering it which helps with visualizing:

The grid world above is what we are dealing with. Several things to note:

  • the color blue indicates the passenger;
  • magenta — the destination;
  • yellow — our empty taxi;
  • green — taxi when full;
  • letters R, G, Y, B — locations.

From visualization alone, we can infer what actions and states are possible: our agent needs to move in its environment to pick up and drop off the passengers. More formally, there are 6 discrete & deterministic actions:

  • 0: move south;
  • 1: move north;
  • 2: move east;
  • 3: move west;
  • 4: pickup the passenger;
  • 5: dropoff the passenger.

The 4 letters corresponding to the locations and the case when the passenger is in the taxi add up to 5 possible passenger locations. Taken together with the 25 locations of the taxi (given by the grid world), we can see how there are 500 discrete states. Running the following line will give you this information without the math:

The last important thing to consider before we can dive into Q-learning implementation for this problem is the reward. Each action takes 1 point meaning a reward of -1; successfully delivering the passenger yields +20; any illegal move costs -10 points. These are the default recommendations from OpenAI which we assume to be optimal.

Q-learning & the Taxi problem

Finally, let’s put everything together by applying the Q-learning algorithm to help the taxi agent do its job. The full code is available here.

We go ahead and determine all our parameters before we start. The parameters are up to you:

For clarity, we will follow the step-by-step procedure we have already discussed, as Figure 1 dictates.

Step 1: Initialize the Q-table

We first need to create our Q-table which we will use to keep track of states, actions, and rewards. The number of states and actions in the Taxi environment determines the size of our table.

Steps 2 – 5: Training the Agent

Note that, because of the recursive nature of our update, we implement steps 2–5 in one cell which you will see if you are following the full code on Github. For tutorial purposes, we are going to dissect that one cell into our steps. Keep in mind that we do training for all of our 2000 episodes all the while keeping track of reward and epsilon values.

We first initialize our training by resetting the environment using the env.reset() method (this is how OpenAI Gym returns a random initial state):

We continue with our steps within the same for loop going through each of the train_episodes:

  • Step 2: Choosing an action. As you see below, there are two options, exploit and explore:
  • Steps 3 & 4: Performing the action & measuring the reward. Once we choose an action, we carry on with it and measure the associated reward. This is done using the built-in env.step(action) method which makes a one timestep move. It returns the next state, the reward from the previous action; done indicates whether our agent has reached the goal (which would mean resetting the environment); info is just performance diagnostics used for debugging.
  • Step 5: Update. Finally, we can use the collected information to update the Q-table using the Bellman equation. Remember that alpha represents the learning rate here.

We finish off by updating our rewards collected during training. To balance out exploration and exploitation — remember the trade-off — we also decrease the probability of taking a random action, which, as you remember, is epsilon. By using decay, we make sure that our agent, after starting out with random exploratory action, settles down on a fixed exploration rate and starts exploitation. This part takes place after we have completed all the steps in a given episode, i.e., out of the for step loop.

And our training is complete! Let’s see how our agent did:

Figure 2: Training results visualized. The left chart shows a Q-learning agent and its collected rewards; the right — the decreasing probability epsilon of choosing a random action.

As Figure 2 suggests, our agent gets on the right track — meaning out of the negative rewards — fairly quickly, but it gets stuck there for some reason (left chart). At the same time, we also see that our exploration rate, represented by the decreasing epsilon values, does diminish with learning (right chart). Our agent does learn to perform good-for-rewards actions, as opposed to random behavior. This is only the very simple training — perhaps, implementing a more complicated algorithm like double Q-learning, will bring some positive cumulative rewards.

If you are not convinced that Q-learning improves performance, consider a case where our agent does not use the algorithm to determine the best action given the current state and, instead, acts randomly. To see this, we can play around with our hyperparameters. Keeping the learning rate the same as in the original scenario, let’s look at a case where the probability of taking a random action — epsilon — is 1, and the agent does not choose a fixed exploration rate, i.e., there is no decay or exploitation (Figure 3, left chart). We see a rather confused driving behavior that never converges to a non-negative reward. Even if we allow for reduced epsilon values and follow the random action strategy, our taxi agent does not do any better (Figure 3, right chart). If anything, we only see sharper differences between the “best” (closer to 0) and “worst” (closer to -500) rewards; we can imagine that our agent does balance between exploration and exploitation, but it cannot do so effectively at the “best” reward since there’s no strategy learning, just random actions. Clearly, training with Q-learning improves one’s taxi driving skills!

Figure 3: Training of non-Q-learning agents that prefer random actions. The left chart shows an agent that always chooses a random action and always explores; the right — an agent that not only explores but still prefers only random actions.

We get these nice visualizations in Figures 2 and 3 by simply plotting the total reward and epsilon values we kept track of during training:

Hopefully, this tutorial was a helpful introduction to Q-learning and its implementation in OpenAI Gym. At the very least, you now understand what Q-learning is all about!

For your next steps, consider looking into extensions, like double Q-learning or deep learning approaches to Q-learning (DQN). You can also check out a tutorial series by my classmates and me which not only using Q-learning but also builds on SARSA and REINFORCE for an interesting Ensemble Learning approach.

--

--

Gelana Tostaeva
The Startup

a [wannabe] computational neuroscience student hoping & trying to make learning effective and personalized while traveling the world with Minerva. @gelana_t