Reinforcement learning: concepts of Q-learning

12 min readJun 6, 2022

Today we focus on developing the concept of Q-learning to solve a MDP. We will talk about the pseudo-code and Python implementation of Q-learning in our next story. In previous stories, we implemented both a model-based ADP learner and a model-free MC learner. Now it’s time to combine the benefits of both and move into Q-learning.

Concepts of learning in reinforcement learning

To develop a Q-learner, let’s review some characteristics of an MC learner first. As we discussed in a previous story, for any state-action pair (s, a), we get an estimation of its value whenever this pair appears in a game. As we keep playing the game repeatedly, we get an estimation G₁ from the first game in which this pair appears, G₂ from the second game in which this pair appears, …, Gₖ from the k-th game in which this pair appears.

Using the principle of Monte Carlo method, we get our expected Q-value from averaging the samples of G-values. In other words, Q-value is an average of all the G-values we have on hand.

For instance, if we only focus on the games in which a certain state-action pair appears, then after the first game, we estimate Q as

After the second game, we estimate Q as

Similarly, after k games, we estimate Q as

As a result, our estimation regarding the true Q-value is updated after each game. Hopefully, our estimation moves closer and closer to the true value after these updates.

In our previous story, we explained that the update rule can be rewritten as

As a result, we have

We can generalize this as

What does this mean?

Before the k-th game, we have a Q-value from the previous k−1 games. This is our old estimate or old knowledge, which equals the average of G-values from these k−1 games.

From out new game, which is the k-th game, we get a new estimation Gₖ. This is called the target because it provides some information of the true value and it kind of shows the direction we should move into. We can also call it our new information.

Now we have some old knowledge and some new information. Combining our old knowledge and new information, we get our new knowledge. This is how we learn something.

In this update, the term [Target − OldEstimate] is usually called error, which is the difference between our old knowledge and the new information. We need to update our knowledge in the direction of reducing this difference. However, if we simply add this error to our knowledge, we get

This means we are losing all our old knowledge and we only believe in our new information. This apparently is not a good learning strategy. A better approach is to use a step-size just as we showed above

By using a step-size smaller than 1, we combine some of our old knowledge and some of the new information. We usually use 𝛼 to denote step-size.

Sample-average vs constant-step-size

In our MC learner, we have

Using the concept of step-size we discussed above, we can rewrite this equation as

where 𝛼 is the step-size as a function of k, and k is the number of games played so far.

In an MC learner, Q is taken as the average of all G samples we have at that time. If we have played k games, Q is then the average from G₁ to Gₖ. Therefore, this approach is called sample-average and it has a step-size of 1/k.

However, this is not the only step-size we can use. We can also use a constant value as step-size regardless of how many games we have played. For instance, we can use 0.3 as 𝛼 in every step. This approach is called constant-𝛼.

How could this even work? Let’s see a simple example.

Assume we played 10 games. For a particular state-action pair, we have an estimation from each game.

To visualize the sample-average approach, let’s use a table to list the results.

As we proved above, using 1/k as 𝛼 is equivalent to sample-averaging. For instance,

and so on.

Now let’s try a constant-𝛼 approach. For instance, let’s try 0.3.

After 10 games, the sample-average approach gives us a Q-value of 5.5, which is exactly the average of all the 10 G-values. The constant-𝛼 approach give us a Q-value of 5.46, which is close enough. Let’s put the results from these two approaches into the same diagram to see how they behave.

As shown in this diagram, results of constant-𝛼 approach becomes closer and closer to the results of sample-average as we play more and more games.

Moreover, adjusting 𝛼 gives us the opportunity to tune the behavior of our learner using the equation

A smaller 𝛼 emphasizes old estimate, while a larger 𝛼 emphasizes target just learned. For instance, let’s try two more extreme cases of 𝛼: 0.1 and 0.9.

As shown in the diagram above, a learner with 0.1 as 𝛼 is a conservative learner who always takes new information with a grain of salt. As this learner barely believes in new information, changes of Q-values between two adjacent episodes are small but steady. As a result, this learner learns the overall trend but seems to be too slow to converge to the results of sample-average.

In contrast, a learner with 0.9 as 𝛼 is a radical learner who tends to believe everything from the new information. Therefore, this learner tends to ignore what happened in the past and heavily emphasizes the most recent information. As a result, this learner fails to capture the average of the samples. Instead, the estimated Q-values follows G-values closely because this learner derives Q-values almost solely from the most recent G-value.

Of course, these are two extreme examples. We can use different 𝛼 to achieve behaviors anywhere in between. In summary, using constant-𝛼 rather than sample-average gives us more options and may results a learner with better performance.

From game-by-game to step-by-step

In an MC learner, the Q-values are only updated between games. During a game, the Q-values are not changed. For each game, we start from empty G-values and add new samples into the table of G-values. Only when one game ends, we update Q-values using the newly learned G-values.

Let’s use an extremely simply example to demonstrate this process. Assume an MDP problem with four states s₁ to s₄ and only one action a.

After k−1 games, we have an estimation for each state as the following

Now we play our k-th game, the series of s, a, r is the following:

From this game, we get our new information Gₖ:

Notice we need to wait for the end of the game to calculate the G-value of any state-action pair. This is usually called deep backups, as the reward of the last state-action pair of the game affects deeply all the way back to the first state-action pair.

After we get G-values for each pair, it’s time to update Q-values. Assuming we use a constant 𝛼 of 0.2, we can update Q-values as the following:

One important observance is that Q(k-1)is not used anywhere in the calculation of Gₖ. In other words, Gₖ is only determined by what happened in the k-th game. When evaluating G-values, an MC learner treats each new game as a fresh start. Each state-action pair is evaluated only by its behavior in the current game, regardless of what happened in previous games.

This gives us another opportunity to improve our learner. When we estimate the utility value for a state-action pair, what if we consider its behaviors not only in current game but also in previous games? How can we achieve that? Well, what about we update Q-values step-by-step instead of game-by-game?

Let’s use the same example, after k−1 games, we still have

Our k-th game is still the following:

After the first step, we have

Instead of waiting for the end of game to calculate G-value of (s₁, a), we take advantage of another piece of information from this game, which is “after this step, our next state is s₂”. For any MDP, the utility between current state s and next state s′ is

As we explained in our previous story, utility value of a state can be derived from Q-values.

For our example, we only have one action for each state, so

Therefore, we can estimate the utility Q(s, a) from Q(s′, a). In this case, our new information is not G(s, a) anymore. In other words, what we are saying is that Q(s′, a) gives us a rough estimation of what will happen in the future steps in this game. As a result, we do not need to wait for the game ends to see what happens. The target we learned is simply

We already have an old estimation of Q(s′, a) from previous k−1 games. In this example, we have Q(s′=s₂, a) = 40. Assuming we use 0.9 as γ. Let’s put this in our table.

Then we can use the same step-size with a constant-𝛼 of 0.2 to update the Q-value of the first state-action pair. Compared to the table of the MC learner, we just replace Gₖ with r + γ Q(s′).

Without waiting for the end of game, we updated the Q-value of the first state-action pair we encountered in the first step. Now we move onto the second step, in which we received a reward of 10.

By the same token, after the third step in which we receive a reward of 1, we can update the third state-action pair.

MC learner vs Q-learner

Comparing MC learner to our new Q-learner, both follow the same updating principal:

Nevertheless, they have several major differences.

First of all, they have different Target. In an MC learner, the newly learned Target is Gₖ, which only depends on the series of rewards r received in this game regardless of any previous estimation of Q-values. While in a Q-leaner, the newly learned Target is r + γ Q(s′), which involves the estimation of Q-values of other states rather than rewards from future steps.

Second, Q-leaner uses bootstrapping while MC learner does not. In a Q-learner, our Target, which is the new guess of a true Q-value, is calculated as r + γ Q(s′). In some sense, we are guessing a certain state-action pair’s Q-value based on another state-action pair’s Q-value, which is also a guess. This “guess something from other guesses’’ is somehow similar to “pull yourself up by your bootstraps’’. As a result, it is usually called bootstrapping.

Thirdly, MC learner must wait until the game ends to update Q-values, while Q-learner does not. In an MC learner, the newly learned information Gₖ cannot be determined until the end of the game because Gₖ is the sum of the rewards of every step from current all the way to the end of the game. In contrast, the newly learned information of a Q-leaner is r + γ Q(s′), which can be determined immediately at the current step. Therefore, MC learner updates the Q-values in a game-by-game basis while Q-leaner updates Q-values in a step-by-step basis without the need to wait for the end of the game.

In short, as shown in this visual comparison, the new information learned in an MC learner is the sum of the rewards occurred in this game all the way to the end. In this example, that is 2 + 10 + 1 +… all the way to the end of the game. That is why we need to wait until the game ends to determine it.

In contrast, a Q-learner uses existing Q-values of the next state to estimate what would happen in the future. As shown in the visual example, we are just saying that the existing Q-value of (s₂, a) of 40 roughly equals to (10 + 1 +…) all the way to the end of the game. Therefore, we do not need to wait until game ends to get an estimation of (s₁, a).

Some may argue that a Q-learner is less accurate than an MC learner because 40 is only a rough estimation of (10 + 1 +…) rather than an accurate calculation. Well, from the perspective of statistics, in some sense, 40 might be more accurate. Why is that? Because Q(s₂, a) of 40 used in a Q-learner represents what happened in the previous k−1 games after we visited s₂ while (10 + 1 + …) in an MC learner only represents what happened in the current single game after we visited s₂. Therefore, Q-learner tends to be more efficient and more effective.

Now we developed the concepts of Q-learning. We will discuss pseudo-code and Python implementation of Q-learning in next story. Stay tuned!

Reinforcement learning: concepts of Q-learning

Table of Contents:

Concepts of learning in reinforcement learning

Sample-average vs constant-step-size

From game-by-game to step-by-step

MC learner vs Q-learner

Written by Nan