Understanding Q-Learning, the Cliff Walking problem

7 min readApr 7, 2018

In the Last post we’ve introduced the Cliff Walking problem and left off with a scary algorithm that made no sense. This time we’ll uncover the secrets of that gray box, and we will see that it isn’t that scary at all.

Recap

We concluded that by maximizing the sum of future rewards we are also finding the quickest route to the goal, so our objective now is to actually find a way to do this!

Enter Q-Learning

Start by constructing a table that measure how good is to take a specific action at any state (we can measure this by using a simple scalar value, so the greater the value the better the action)
This table will have one row for each state and one column for each action. Our grid world has 48 (4 x 12) states and 4 actions are allowed, so the table would be 48 x 4.

The values stored in this table are called ‘Q-values’.
These are estimates of the sum of future rewards. In other words, they estimate how much more reward we can win until the end of the game being at state S and taking action A.
We initialize it with random values (or some constant, e.g. all zeroes).

The optimal Q-table has values that allow us to take the best action at every state, giving us the best path to victory!
Problem Solved, Welcome Robot Overlords. (If only solving a simple problem creates an AGI)

Q-Learning

Q-learning is an algorithm that ‘learns’ these values.
At every step we gain more information about the world.
This information is used to update the values in the table.

For example, suppose we are one step away from the goal (square [2, 11]), if we choose to go down we’ll receive a reward of 0 instead of -1.
We can use this information to update the value of this state-action pair in our table so that the next time we visit it we already know that going down gives us a reward of 0.

Now we can propagate this information even further! Since we now know the path to the goal from square [2, 11], any action that takes us to square [2, 11] will also be good, so we update the Q-value of the square that lead us to [2, 11] to be closer to 0.

And this, ladies and gentlemen, is the essence of Q-learning!

Note that every time we reach the goal we increase our “map” of how to reach the goal by one square, so after a sufficient number of iterations we’ll have a complete map, that will show us how to get to the goal from every state.

Path generated by taking the best action at each state. The green tonality represents the value of the best action, stronger tonalities represent higher values. The text represent the value for each action (Up, Down, Right, Left).

Bellman Equation

Before we talk code, let’s talk math: the main concept behind Q-learning, the Bellman equation.

Below is an math to English translation:

First let’s forget γ in this equation
The equation states that the Q value for a certain state-action pair should be the reward gained by the moving to the new state (by executing that action) added to the value of the best action in the next state.

In other words, we are propagating the information about the value of the actions one step at a time!

But we might think that receiving a reward right now is more valuable than receiving a reward on the future, and that’s why we have γ, a number between 0 and 1 (generally between 0.9 and 0.99) that get’s multiplied to the future reward, making future rewards discounted.

So considering γ = 0.9 and applying this to some states of our grid world we have:

We can compare these values with those of the GIF above and see that they are the same!

Implementation

Now that we have an intuition on how Q-learning works, we can start thinking about implementing all of this, we’re going to use the Q-learning pseudo-code from Sutton’s book as our guide.

Translator 2: Pseudo code to Python Code.

Q-learning pseudo-code, stolen from Sutton’s book.

First it says “For all states and actions initialize Q(s,a) arbitrarily”, this means we could create our Q-values table with any values we like, they can be random, they can be some constant, doesn’t matter. We can see that in line 2 we’re creating a table full of zeroes.

It also says “Q value for terminal states equals zero”, we can’t take any actions at terminals states, so we consider the value for all actions on that state to be zero.

For each episode we should “initialize S”, that’s just a fancy way of saying “reset the game”, in our case it means to put the player in the initial position, our grid world has a method that does just that and we are calling it at line 6.
For each time step (each time we need to act) we should choose an action derived from Q.
Remember when I said “we take the action that has the highest value at every state”?
When we do this we are using our Q-values to create a policy, in this case this would be a greedy policy because we always take the action that we believe is best at every state, so it’s said we are acting greedily.

Hiccup

But there’s a problem with this approach, image we are in a maze that has two rewards, one of +1 and other of +100 (and every time we find one of them the game ends), because we always take the action we believe is best we’re going to get stuck with the first reward we find, always going back to that, so if we first find out the reward of +1 we’re going to miss out on the bigger reward of +100.

Solution

We need to make sure we sufficiently explore our world (this is a surprisingly difficult task). Here is where ε in ε-greedy comes in, it means that we should act greedily BUT take a random action ε percentage of the time, so with a infinite number of tries we should explore all states.

The action is chosen following this strategy at line 12, with epsilon=0.1, meaning we explore 10% of the time. The implementation of the policy is done as follows:

In line 14 we simply call the step method to execute the action, the world returns to us the next state, the reward and if the game is over.

Back to our math translator:

We have a long equation, let’s think about it:

If we consider α = 1:

Which is exactly the same as the Bellman equation we saw some paragraphs ago! So we already now that this is the line responsible for propagating the information about the value of the states.

But generally α (mostly know as the learning rate) is much smaller than 1, its main purpose is to avoid big changes in only one update, so instead of flying directly into the target, we slowly move to it. In our tabular approach, setting α = 1 doesn’t cause any problems, but when dealing with neural networks (more depth on this in the next posts) things can easily get out of hand.

Looking at the code, we see that in line 16 we defined the TD target, this is the value we need to move closer to, but instead of directly jumping to this value in line 17 we calculate the TD error, we will use this value in conjunction with the learning rate to slowly move to the target.

Remember, this line is the essence of Q-Learning.

Now we just need to update our state and we’re done, this is line 20. We repeat this process until we reach the end of the episode, dying into The Cliff or reaching the objective.

Conclusion

Now we intuitively understand and know how to code Q-Learning (at least the tabular variation), be sure to check all the code used for this post available on my GitHub.

Checkout the visualization of the Learning Process:

Notice that all actions start with a value higher than it’s final value, this is a trick for encouraging exploration.

More depth on this in the upcoming posts.

Subscribe to our Newsletter for more on Self Driving Cars, Deep learning, Computer Vision and now: Flying Cars.