Key Takeaways 3 (MC Methods) From Sutton and Barto RL textbook

What you should know about Monte Carlo Methods

Published in

MITB For All

11 min readSep 4, 2023

In Chapter 5, we learn about first-visit vs. every-visit MC, on/off-policy prediction and control with/without exploring starts, and importance sampling.

Background

This is the third part of my series on the Reinforcement Learning (RL) textbook by Sutton and Barto. The first part (corresponding to Chapter 2 and 3) covers fundamental concepts/notations along with multi-armed bandits and Markov decision processes. The second part (corresponding to Chapter 4) covers dynamic programming and policy iteration.

Limitations of earlier approaches

Up till now, we have assumed complete knowledge of the environment, ie. that p(s',r|s,a) is known to us.

Recall that G is the sum of all future returns, discounted by an appropriate γ. So far, we have learnt how to solve for the value function in Example 3.8 via two approaches.

By utilizing the Bellman equation, we know that each ‘point’ in v(s), ie. each value corresponding to a particular state, are interdependent. Therefore, we can solve for v(s) across the state-space simultaneously by linear algebra, as shown in Key Takeaways 1.
We can also apply dynamic programming, and iteratively update the value of each state. This can be done because the return G at terminal states are 0 — we do not have any future reward thereafter. By having all state values initialized to 0, these terminal states are already correct to begin with. Under a deterministic policy, states which precede these terminal states will have the correct value in the first sweep—the return is simply r, given that v(s')=0 when s' is terminal. In the second sweep, states which are two ‘steps’ away would become correct, and so on. In general, both the policy and transition may be stochastic, hence we would take a weighted average across the probabilities π(a|s) and p(s',r|s,a).

In reality, we usually do not have full knowledge of the environment. This means the above would not work.

Why Monte Carlo?

The image of a casino is probably what comes to mind when the word ‘Monte Carlo’ pops up. As data science practitioner, you should think of simulations and statistics.

Imagine that you enter a casino, and count that in the past 10,000 spins of roulette, the results are 4,870 red, 4,870 black and 260 ending in ‘0’. Even without seeing the roulette wheel, it will be fair to estimate that the probability of getting red is 0.487, and the same is true for black.

What’s next?

More often than not, the agent only gets to sense the environment, but does not know the underlying transition dynamics. This is not a big problem, because we can simply sample from the environment.

We apply statistics to the observed experience. Here, experience is a technical word. It is the history of actions taken in past states, with the corresponding next state and reward obtained, both of which depend on the environment and could be stochastic.

Our focus is on model-free RL. It means we will not attempt to create a model of the environment. Do not be confused and associate the word ‘model-free’ with having no parameterized function or learning involved. We still learn a value function and/or policy.

Monte Carlo Methods (Chapter 5)

You would notice that section 5.1 is called ‘MC Prediction’ while section 5.3 is called ‘MC Control’. Questions may come to mind. Isn’t it the case where we are trying to make some sort of prediction, for the purpose of control? Why is that a distinction between what is seemingly the same thing — Monte Carlo?

MC Prediction

In this context, prediction refers to learning the state value function, v(s), of a given policy. It is about predicting what is the expected returns G under a given state, assuming the agent follows the policy π.

Rather than compute a weighted average of expected returns using the probability distributions, we simply observe how a trajectory ends up. All the states visited in a particular trajectory can be updated, because the subsequent rewards are known.

Sutton and Barto described three advantages of using MC over DP:

the ease of generating sample episodes (actual or simulated experience) to work with, which can be a “significant advantage even when one has complete knowledge of the environment’s dynamics”.
the independence of the estimates for each states, given that we do not bootstrap.
the ability to focus only on states of interest and ignoring other states which would not be visited anyway.

The distinction between first-visit MC and every-visit MC is also made in Section 5.1.

The textbook gave the blackjack example in Example 5.1. Instead, I will use a simple gridworld environment to illustrate how first-visit and every-visit MC works (as well as for the rest of this article).

Gridworld example

Suppose we have the environment described in Example 3.8 (textbook page 72). Let’s modify it such that the bottom-right cell is a terminal state. This gives us an episodic task. In addition, we prescribe that the special bonus of +10 (at A) and +5 (at B) can only be obtained once, to prevent the agent from wanting to remain in an episode forever.

The environment is defined as follows. It has the option of exploring starts (more on this later) built in.

class Env:
    ## Modifying example 3.8 of the textbook (page 72)
    def __init__(self):
        self.min_row, self.max_row = 0, 4
        self.min_col, self.max_col = 0, 4
        self.terminal = [[self.max_row, self.max_col]]   # only one terminal state
        self.reset()
    
    def reset(self, random=True):
        self.flag01, self.flag03 = True, True
        if random:
            # note the need to deal with the possibility of starting at terminal states in exploring starts
            while True:
                self.cur_state = [np.random.randint(self.max_row + 1), np.random.randint(self.max_col + 1)]
                if self.cur_state not in self.terminal:
                    break
        else:
            self.cur_state = [0,0]
        return self.cur_state
    
    def _get_state_dim(self):
        return (self.max_row+1, self.max_col+1)
    
    def transition(self, state, action):
        reward = 0
        if state == [0,1]:
            state = [4,1]
            if self.flag01:
                reward = 10
                self.flag01 = False
        elif state == [0,3]:
            state = [2,3]
            if self.flag03:
                reward = 5
                self.flag03 = False
        else:
            if action == 0:
                state[1] += 1   # move right one column
            elif action == 1:
                state[0] += 1   # move down one row
            elif action == 2:
                state[1] -= 1   # move left one column
            elif action == 3:
                state[0] -= 1   # move up one row
            else:
                assert False, "Invalid action"
            
            if (state[0] < self.min_row) or (state[1] < self.min_col) \
                or (state[0] > self.max_row) or (state[1] > self.max_col):
                reward = -1
        
        next_state = np.clip(
            state, [self.min_row, self.min_col], [self.max_row, self.max_col]
        ).tolist()
        done = True if next_state in self.terminal else False
        return reward, next_state, done

Note that strictly speaking, the same state in an MDP is indistinguishable and independent of its history. However, I am deliberately introducing a differentiating factor above simply to exaggerate the difference between the first-visit and every-visit approach. In class, we will explore what is a more practical scenario that may be encountered in reality.

I write a simple class that can be used to compute the state-value function via first-visit or every-visit MC. Just set the arg in the run() method.

class MonteCarlo:
    def __init__(self, Env):
        self.Env = Env
        self.V_s = np.zeros((self.Env.max_row+1, self.Env.max_col+1))
        self.visits = dict(
            [((i,j),0) for i in range(self.Env.max_row+1) for j in range(self.Env.max_col+1)]
        )
        self.gamma = 0.95
        
    def policy(self, state):
        return np.random.choice(4)   # 25% probability each for 0/1/2/3
    
    def sample_episode(self):
        episode = []
        state = self.Env.reset(random=True)
        
        done = False
        while not done:
            action = self.policy(state)
            reward, next_state, done = self.Env.transition(deepcopy(state), action)
            episode.append((state, action, reward))
            state = next_state
        
        return episode

    def predict_v(self, num_episodes, first_visit=True):
        for _ in tqdm(range(num_episodes)):
            episode = self.sample_episode()
            G = 0
            visited_states = set()   # reset on each new episode
            
            for state, action, reward in reversed(episode):
                G = reward + self.gamma*G
                s_tuple = tuple(state)
                
                if first_visit:
                    if s_tuple in visited_states:
                        continue
                    else:
                        visited_states.add(s_tuple)
                
                self.visits[s_tuple] += 1
                self.V_s[s_tuple] += 1/(self.visits[s_tuple]) * (G - self.V_s[s_tuple])

To run, you can simply plug-and-play as follows.

env = Env()
agent = MonteCarlo(env)
agent.predict_v(5000, first_visit=False)
sns.heatmap(
    agent.V_s, annot=True, cmap='coolwarm', fmt='.1f'
)

(Like my codes? Follow me and you’ll have lots more codes to take!)

Moving on to Section 5.2 of the textbook, we see that this concept of sampling from experience can be used to compute the action value function, q(s,a), directly. (In Figure 5.1 and 5.2 of the textbook, as well as my code above, it’s the state-value function v(s) that has been computed.)

Having the action value q(s,a) is actually more helpful towards finding a good policy, because we can directly determine which action under a particular state leads to the highest expected future cumulative rewards.

To do so, the above code can easily be modified such that the keys of your dictionary are the state-action-pairs. Just modify like this (only the relevant portions shown):

class MonteCarlo:
    def __init__(self, Env):
        self.Q_sa = np.zeros((self.Env.max_row+1, self.Env.max_col+1, 4))
        self.visits_sa = dict(
            [((i,j,k),0) for i in range(self.Env.max_row+1) for j in range(self.Env.max_col+1) for k in range(4)]
        )
    
    def predict_q(self, num_episodes, first_visit=True):
        for _ in tqdm(range(num_episodes)):
            episode = self.sample_episode()
            G = 0
            visited_sa_pairs = set()   # reset on each new episode
            
            future_rewards = np.zeros(len(episode))
            for i, (state, action, reward) in enumerate(reversed(episode)):
                G = reward + self.gamma * G
                future_rewards[-(i+1)] = G
            
            for i, (state, action, reward) in enumerate(episode):
                G = future_rewards[i]
                sa_tuple = tuple(state+[action])
                
                if first_visit:
                    if sa_tuple in visited_sa_pairs:
                        continue
                    else:
                        visited_sa_pairs.add(sa_tuple)
                
                self.visits_sa[sa_tuple] += 1
                self.Q_sa[sa_tuple] += 1/(self.visits_sa[sa_tuple]) * (G - self.Q_sa[sa_tuple])

MC Control

Notice that the first figure in Section 5.3 is very similar to Figure 4.7 . That is because the concepts are similar! This time, we iterate between action value evaluation and policy improvement. We now learn q_{\pi} instead of v_{\pi}. With this q_{\pi}, we can improve the policy by acting greedily.

Here, the strategy of exploring starts is also mentioned. It is simply starting at a random state. This makes it much more likely for the agent to experience a good state, and avoids the scenario of being trapped in some local optimum and having to rely on pure chance (which is miniscule) to get to that state.

After all, if the agent doesn’t even get a glimpse of the desirable state-actions pairs, how could we expect the agent to learn its way towards that?

With the code shared above that allows us to compute action-value function through sampled observations, we can implement this iterative process. Again, I will just show the relevant portion.

class MonteCarlo:
    def get_policy(self, Q_sa=None):
        Q_sa = Q_sa if Q_sa is not None else self.Q_sa
        greedy_policy = np.zeros((self.Env.max_row+1, self.Env.max_col+1))
        
        for i in range(self.Env.max_row+1):
            for j in range(self.Env.max_col+1):
                greedy_policy[i,j] = np.argmax(Q_sa[i,j,:])

        return greedy_policy
    
    def iterate(self):
        while True:
            old_Q = deepcopy(self.Q_sa)
            self.predict_q(num_episodes=2) # default is actually 1
            new_policy = self.get_policy()
            if np.max(np.abs(old_Q - self.Q_sa)) < 1e-3:
                break
        return new_policy

Sutton and Barto also discussed the converge guarantees (towards the optimal policy) that this MC method brings, subject to the assumption of exploring starts and an infinite number of episodes.

As a practitioner, I am of the view that we should not be too caught up with the presence or absence of such guarantees — your boss doesn’t care. You may have an approach with such guarantees, but if the results are not promising, do you wait until infinity? And on the other side of the coin, you could get a well-functioning agent that serves the business objectives, even without such convergence guarantees.

Without Exploring Starts

In the code above, the environment has a reset method defined as

def reset(self, random=False):
        if random:
            # simplified for readability; did not deal with terminal states here
            self.cur_state = [np.random.randint(self.max_row + 1), np.random.randint(self.max_col + 1)]
        else:
            self.cur_state = [0,0]

Setting random to be false removes the exploring starts. To ensure that all actions still get selected infinitely often, an ε-soft policy can be used. You have already heard about the ε-greedy policy in Chapter two, where a random action is taken with probability ε, and a greedy action is taken with probability (1−ε).

The ε-soft policy is a general form, where a random action is taken with probability ε. For the remaining other (1−ε), you could have, say, a softmax of the Q values instead of the argmax.

The math equations are just there to show that policy iteration works for ε-soft policy. Don’t sweat over it; knowing how to implement is more important than deriving a math proof here.

Importance Sampling

In Section 5.5, we have our first touch of the distinction between ‘on-policy’ and ‘off-policy’. But first, a concept that must be understood is this — importance sampling.

There’s plenty of math equations here. If you are lost, try to get the intuition first. Consider the following example.

Suppose, in some Gridworld environment, we have dist_A (blue) as well as dist_B (orange), representing the probability distributions of policy_A and policy_B, respectively. Following policy_A, we know that the expected return at state Sₚ is 0.1(4) + 0.2(5) + 0.3(7) + 0.4(9) = 7.1. Now, if we were to estimate the return based on observations sampled from policy_B, adjustments are required to ensure fairness.

Probability distribution of two hypothetical distributions. Created using matplotlib by author.

(For simplicity, we assume that the action probabilities at all other states are identical, and that they differ only at state Sₚ.)

By sampling from policy_B, we can expect that out of every 100 trials, 40 of the actions would be ‘left’, 30 would be ‘down’, 20 would be ‘left’ and 10 would be ‘up’.

Each time the ‘left’ action is selected from policy_B, it should be given less importance because we rarely choose this action under policy_A. Meanwhile, the ‘up’ action is selected, it should be given more importance because it would actually be frequent under policy_A. Adjusting by the appropriate ratios, we would have 0.4((0.1/0.4)*4) + 0.3((0.2/0.3)*5) + 0.2((0.3/0.2)*7) + 0.1((0.4/0.1)*9) which gives an accurate estimate.

Note that in equations (5.4) and (5.5) of the textbook shows the simple weighted and weighted average over Gₜ, respectively. Just apply the same line of thoughts.

In Section 5.6, Sutton and Barto talked about an incremental implementation, on an episode-by-episode basis. I already did that in the code above (under Gridworld example). If you are unsure, it is simply this portion.

self.Q_sa[sa_tuple] += 1/(self.visits_sa[sa_tuple]) * (G - self.Q_sa[sa_tuple])

Off-Policy

In Section 5.7, we look at the difference between on-policy and off-policy learning. This is not to be confused with online vs offline learning.

We need the definition of a target policy (π) and a behavior policy (μ or b —the in-progress version uses μ while the newer book uses b). The target policy is the policy that is evaluated and improved, while the behavior policy is what we use to generate the experience to learn from. π could be greedy with respect to the estimated action value function, while μ can be any ε-soft policy (to ensure all state-action pairs are sampled).

You may refer here for a detailed example with results.

Sutton and Barto pointed out a drawback that “this method learns only from the tails of episodes”. This is specifically referring to the use of a greedy target and ε-soft behavior policy with weighted importance sampling. The greedy policy has zero chance of taking an exploratory action, ie. π(a∣s)=0 if a does not have the highest value.

Importance sampling can lead to high variance, especially when the policies are dissimilar and the product in the numerator and denominator of the ratio varies by many orders of magnitude.

The truncated returns in Section 5.8 attempts to mitigate this drawback by considering the ratio of probabilities up till some horizon h. While this may introduce bias, the reduce in variance can be worth the trade-off.

Conclusion

Great! You now understand Chapter 5 of the most popular RL textbook!

In my fourth article for this series, I will be writing about Chapter 6, which covers temporal difference learning.

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.