Reinforcement Learning Chapter 5: Monte Carlo Methods (Part 1 — Monte Carlo Prediction)

3 min readApr 12, 2023

Chapter 5 Series:

The previous few articles covered Dynamic Programming methods as the first set of solutions to the full reinforcement learning problem. In this and the next few articles, we’ll learn about a new set of methods called Monte Carlo methods.

Monte Carlo

Monte Carlo (MC) methods are a class of solutions to the full RL problem based on averaging sample returns.

They can be thought of as similar to the sample averaging methods in Chapter 2 except instead of averaging immediate rewards for each action, they average return (discounted, cumulative reward) for each state-action pair.

Monte Carlo vs. Dynamic Programming

MC methods differ from DP methods in 3distinct ways:

Firstly, they do not assume complete knowledge of the environment

DP methods require a complete world model that is often unavailable in practice. MC methods have no such requirement.

Secondly, they only require experience (simulated or actual)

DP methods can use their perfect world model to calculate optimal value functions without ever interacting with the actual environment. MC methods, on the other hand, learn value functions from experience.
The generated experience can either be simulated or actually from the environment. Often it’s possible to generate simulated experience according to some distribution even when it’s infeasible to obtain the distribution itself (e.g the game of Blackjack in which winning probabilities for specific hand combinations are difficult to compute but simulating a game is easy).

Thirdly, they do not bootstrap

Estimates for each state are independent of other states
This also means that the computational expense of MC methods is not dependent on the number of states

Monte Carlo Prediction

Recall the prediction problem where you have some policy π and you wish to estimate its state value function vπ. How can we do this with MC methods?

One way would be to run multiple episodes of an environment following the policy π, then estimate the value of each state to be the average return after the first time the state was visited. We call this first-vist MC.
Alternatively, we could also average the returns after every time the state was visited — called every-visit MC.
As we increase the number of sample episodes, both approaches will converge to the true vπ.

Below is an algorithm for first-vist MC prediction

Each step we are discounting G to represent discounted future rewards
The ‘Unless’ statement is just checking to make sure this is the first visit for the state Sₜ, in which case we record the return and update the average

Implementation

Below is a Python implementation of first-visit MC prediction for the simple 4x4 grid world defined in the previous chapter. Implementations of the helper functions can be found on GitHub.

When run on the 4x4 grid world over 10k episodes it generates the following state values:

In this article, we learned about the basics of Monte Carlo methods, how they differ from Dynamic Programming methods, and how they can be used to solve the prediction problem. In the next article, we’ll look at how MC methods can solve the control problem.

Reinforcement Learning Chapter 5 — Monte Carlo Methods (Part 2: Monte Carlo Control)

In the previous article, we learned about Monte Carlo methods, how they differ from Dynamic Programming methods, and…

medium.com