Reinforcement learning: Model-free MC learner with code implementation

14 min readMay 31, 2022

Today we focus on building a Monte Carlo (MC) agent to learn a MDP. In a previous story, we implemented a model-based ADP learner which estimates a model of reward function r(s) and transition probabilities p(s′| s, a). This model-based approach may work efficiently in some cases. However, if the transition model is difficult to estimate, a model-free approach tends to be a better choice. Monte Carlo (MC), which is our topic today, is one example of such model-free approaches.

The code in this story is part of our MAD from scratch project where MAD stands for machine learning, artificial intelligence, and data science. The complete code used in this story can be found in this repo: https://github.com/clumsyhandyman/mad-from-scratch

Values of states vs values of state-action pairs
Q-values of state-action pairs
Estimate Q-values using Monte Carlo method
Visualizing MC method with state-action table
First-visit vs every-visit
Pseudo-code for model-free MC learner
Implement MC learner in Python
How good is our MC learner?
Improve MC learner by training more
Improve MC learner by tailoring reward
From Monte Carlo to Temporal Difference

Values of states vs values of state-action pairs

In model-based methods, our policy is derived from the utility values of states. Following a certain policy, the utility value v(s) for a certain state s is defined as

For the optimal policy, the associated utility value v(s) is then

This approach is called a model-based method because we need p(s′| s, a), which is the transition model, in the calculation. Now we would like to eliminate the need of knowing p(s′| s, a), what can we do? One possible method is to assign values to state-action pairs rather than states.

In a model-based approach, we rely on values of states. When we are in a certain state, we check possible next states that this states is connected to. For instance, from our current state, we can reach s₁ which has a utility of 10 and s₂ which has a utility of 5, and so on. We then check the effects of choosing certain actions. For instance, if action a₄ tends to lead us to s₁ while action a₆ tends to lead us to s₂, then we would prefer a₄ over a₆.

In contrast, in a model-free approach, we rely on values of state-action pairs. Rather than assigning values to states, we assign values to state-action pairs. When we are at our current state, for example s₁, we check possible actions at this state. For instance, maybe for action a₄, the utility value of the state-action pair (s₁, a₄) is 19, while for action a₆, the utility value of the state-action pair (s₁, a₆) is 35. In this case we would prefer a₆ over a₄.

Q-values of state-action pairs

The utility value of a state-action pair (s, a) is usually denoted by Q(s, a) and thus is usually called Q-values. By definition, Q-value of (s, a) is the expected total rewards after choosing a certain action a at a certain state s.

As we mentioned before, for the optimal policy, the utility of a state can be expressed as

In this equation, for a given state s, the reward r(s) and the discounted factor γ are constant values that are independent of action a. Therefore, we can put these two into the max function.

What is this equation telling us? If we know or if we are able to estimate Q-values, then we can choose our actions solely based on Q-values without the need of dealing with s′ or p(s′| s, a).

Previously, we determined the optima policy from the state values:

Again, because r(s) and γ are constant values which do not affect the function argmax, we now have

As a result, we do not need to know or estimate transition model anymore. We can directly choose actions by their relative attractiveness represented by Q-values.

Estimate Q-values using Monte Carlo method

Now the question is: how to estimate Q-values? One way is to simply try out each state-action pair and see how it goes. In other words, we play this game repeatedly for a large number of times. In each game play, we record the series of states, actions, and rewards like the following

As we mentioned in previous stories, in reinforcement learning, this series of s, a, r, s, a, r… is the only game we know. Here we assume our game ends in finite steps. We use n to denote the total number of steps in this particular game play. Since we have n steps in this game, we travel through n states: from step t = 1 to step t = n. These states may not necessarily be n distinct states, because some states may appear more than once. Nevertheless, if we list the states according to the time sequence we visited them, we have a series of n states from t = 1 to t = n.

At each of these states, we made a choice of action, which is the series of actions from step t = 1 to step t = n. As a result, we experienced a series of state-action pairs:

For this first state-action pair, the outcome of this pair can be simply evaluated as the total rewards after we visited this state-action pair.

Similarly, the outcome of the second state-action pair is evaluated as the total rewards after that pair:

More generally, for the state-action pair at step i, the outcome of that pair can be evaluated as

Therefore, by playing one game, we have a rough estimation of the values of these state-action pairs (s, a) we encountered in this game. If we play this game repeatedly forever, we inevitably encounter every state-action pair (s, a) again and again. Each time we have an estimation G(s, a) for this pair, we have a sample of the value for this pair. Now we have a list of these samples.

Using the principle of Monte Carlo methods, if we average this list of G(s, a) samples, we get its expected value.

Visualizing MC method with state-action table

In this MC method, we need to estimate the values of each state-action pairs. In a typical MDP, the number of states and the number of actions are usually known. Therefore, we can build a big table with rows as states and columns as actions.

In one game play, if we encounter a certain pair of (sᵢ, aⱼ), we put G(sᵢ, aⱼ) into the table grid corresponding to the row of sᵢ and the column of aⱼ, which is the (i, j) element of the table. As we repeatedly play this game, we built such a table for each game play. At the beginning of each game, the table is empty. During the game, we put G(sᵢ, aⱼ) into the (i, j) element for each (sᵢ, aⱼ) we encountered.

As a result, we get one table of G(s, a) from one game and we can get a lot of these tables from playing this game repeatedly for a lot of times.

Finally, the average of all these tables is the table of Q-values. When we calculate the average values for a certain state-action pair, we only count the game plays where this particular pair appears. For example, if we played this game repeatedly for 1000 times and if (s₂, a₂) appears in 670 of these 1000 games, we get Q(s₂, a₂) from averaging G(s₂, a₂) of these 670 games.

First-visit vs every-visit

In one game play, in its series of [s, a, s, a, s, a, s, a…], a certain pair of (s, a) may appear more than once. As we discussed above, the outcome G(s, a) of a certain state-action pair is defined as the sum of the returns after encountering that pair. In case of a particular pair appears multiple times, this definition seems a bit of ambiguous.

If using the sum of return after the first appearance of that pair in the game play, this method is called first-visit MC method. In this method, it doesn’t matter if we encounter a certain pair repeatedly. We only care about the first occurrence of that pair in a game play. For example, in one game, if we have

We then estimate the state-action pair (s₁, a₃) as

We then append this G(s₁, a₃) value to the list R(s₁, a₃).

In contrast, we can also use every-visit MC method. In this method, whenever we encounter a particular state-action pair, we sum the return from this position all the way to the end of that game and use this as one entry of R(s, a). Using the example above, in this method, since the pair (s₁, a₃) appears twice, we have two estimations of G(s₁, a₃):

As a result, we append these two G(s₁, a₃) values to the list R(s₁, a₃).

Here in our story today, we focus on the first-visit MC method in the following code implementation.

Pseudo-code for model-free MC learner

Similar as the API of our previous model-based ADP learner, our model-free MC learner also has functions of percept and actuate.

In percept function, we add the reward r of the current step to the G values of all the (s, a) pairs that have been visited previously in this game. By doing this, we ensure that the G value of any (s, a) pair is the sum of the reward after we first visited this pair all the way to the end of the game.

Pseudo-code for percept function of MC learner

In actuate function, we apply the same geometric decay as in model-based ADP learner to control the transition from exploration to exploitation.

Pseudo-code for actuate function of MC learner

After each game, we need to update our policy to reflect what we have learnt in this game. By doing this update, we ensure we can become better and better by playing more and more games. After the updates, we re-initialize the tables of G values and visited to get ready for the next game.

Pseudo-code for policy update function of MC learner

Using the principle of Monte Carlo sampling, the Q-value after n − 1 games is the average of the G-values from these n − 1 games. Similarly, the Q-value after n games would be the average of the G-values from these n games.

During policy update, if a certain pair (s, a) is visited in current nth game, we have its estimation Gₙ from this game and its Q-value of Qₙ₋₁ from the previous n − 1 games.

As shown in the pseudo code above, we have the Q-values updated using this rule.

To train our learner, we use the following pseudo code to repeatedly play the game for N times, where N can be an arbitrarily large number. Each game of these N games can also be called an episode or an epoch. In this case, we keep track of the total reward received in each game as well as whether we win or lose in each game.

Implement MC learner in Python

Similar as our previous ADP learner, MC learner is also implemented as a class with percept, actuate, and policy update as functions.

To solver our gird problem with a MC learner, we implement the following solver:

The following code can be implemented to solve our grid world example. In this case, we train our MC learner for 1000 episodes.

How good is our MC learner?

Let’s see how well our MC learner works! We apply our MC learner to the same grid problem as with ADP learner. Again, starting from the black circle. Again, with wining reward of 10.0 for green grid, loosing reward of −2.5 for red grid, and a reward of −0.04 for any other passable grid.

Using a similar approach to balance exploration and exploitation as in ADP learner, we first try with ξ of 0.99.

At the beginning, we know nothing about the game, and we rely on randomly guessing. Therefore, at the very beginning, we get around −1000, −900, −800… As we play more and more, we learn more and more. As a result, the reward we get from a single game tends to get better. However, in terms of the sum of total reward we received from 1,000 games, MC learner performs much worse than ADP learner in our previous story.

Let’s see what happens if we change ξ to 0.999.

Although the general trend is similar as ξ of 0.99, when ξ is 0.999 we spend more in exploration, which results a worse total reward.

We can also compare the performance of ADP learner and MC learner using the winning percentage during the training episodes. Since both learners are trained for 1,000 games, we can compare the winning percentage of their first 100 games, second 100 games, and so on.

As shown in this diagram of winning percentages, our ADP learning with ξ of 0.99 performs the best: after just 200 games, it can win almost 100% of the games. For ADP learner with ξ of 0.999, it manages to win almost 100% after around 700 games. For MC learners, if ξ is set as 0.99, through learning it improved its winning percentage from about 10% to almost 80%. Nevertheless, it struggled to achieve a winning percentage over 80%, not to mention 100% as achieved by the ADP learner.

In summary, with the same MDP problem, rewards, number of training episodes, and decay factors, our MC learner performs worse than our previous ADP learner.

Improve MC learner by training more

How can we improve the performance of our MC learner? To do that, first we need to investigate why MC learner tends to deliver worse results than ADP learner. Our MC learner, which is model-free, does not learn the transition model. As a result, it learns less information than an ADP learner. Therefore, by intuition, if MC learner and ADP learner are trained with the same episodes, MC learner would learn less, and thus, perform worse.

How can we improve that? Well, one simply way is to spend more training episodes. Since MC learner learns less per episode, if it is trained with more episodes, it may be able to perform similarly or even better than ADP learner. Let’s see what happens if we train our MC learner for 10,000 episodes, 10 times as before.

If we give the opportunity of training 10,000 games to our MC learner, it gets a winning percentage of about 82%, which is slightly better than before. As shown in the winning percentage diagram above, after about 5,000 games, the winning percentage tends to stop increasing. In summary, simply adding training episodes can marginally improve the results of an MC learner.

Improve MC learner by tailoring reward

Another possible way to improve the performance of MC learner is to tailor the reward. For example, in our problem, the winning reward is 10 and the losing reward is −2.5. What if we change 10 to some other values?

Some may argue that this reward is an internal part of the MDP problem. How can we solve a problem by arbitrarily changing the original problem? Is this cheating?

Well, when we say we try to solve this MDP problem, what do we really mean by solve? In our problem, the rewards is not the ultimate goal. Our goal is actually “reach the green grid and avoid the red grids”. To achieve this goal, we assign some positive rewards to the green grid and some negative rewards to the red grids to let our learner know which grid is preferable. In this sense, the reward assigned to these grids can be arbitrary. Therefore, we can tailor or tweak these reward values to help our learner to achieve that ultimate goal. For instance, what if the losing reward is still −2.5 but the winning reward is increased from 10 to 1,000? Can this help our MC learner to know that we really really really want it to reach that green grid?

This time with winning reward adjusted to 1,000 instead of 10, our MC learner finally achieves 100% winning percentage within 1000 games. This proves tweaking rewards is a valid method to improve MC learner.

If we compare the results before and after tweaking the rewards, the ADP learner does not change that much while the MC learner gets significantly better. As an ADP learner learns the rewards as well as the transitional model, the change in rewards does not have that much significance. In contrast, MC learner only learns the rewards. Therefore, adjusting reward can considerably change the performance of an MC learner.

From Monte Carlo to Temporal Difference

Comparing our MC learner to previous ADP learner, the benefit is that we get rid of estimating transition model. However, we pay a heavy cost for this benefit: MC learner requires many more training episodes. Even after trained with many episodes, the performance of MC learner is still not that great. Can we do better than this? Can we have another model-free learner that has better performance than ADP learner without the need of tweaking rewards?

The answer tends out to be yes. One of such methods is called temporal difference (TD). In reinforcement learning, a popular type of temporal difference is called Q-learning, in which the Q-values are learnt directly.

To begin with, in MC learner, we update the Q-values after each game. In order to update the Q-values for many many times, we need to play the game many many times. However, in Q-learning, we are able to update Q-values after each step within a game. As a result, we can update Q-values more frequently and more accurately with much less games. We will talk about Q-learning in our next story. Stay tuned and see you next time.