Reinforcement Learning Chapter 5: Monte Carlo Methods (Part 1 — Monte Carlo Prediction)
Chapter 5 Series:
- Part 1 — Monte Carlo Prediction
- Part 2 — Monte Carlo Control
- Part 3 — MC without Exploring Starts
- Part 4 — Off-policy via Importance Sampling
Code: https://github.com/nums11/rl
The previous few articles covered Dynamic Programming methods as the first set of solutions to the full reinforcement learning problem. In this and the next few articles, we’ll learn about a new set of methods called Monte Carlo methods.
Monte Carlo
Monte Carlo (MC) methods are a class of solutions to the full RL problem based on averaging sample returns.
They can be thought of as similar to the sample averaging methods in Chapter 2 except instead of averaging immediate rewards for each action, they average return (discounted, cumulative reward) for each state-action pair.
Monte Carlo vs. Dynamic Programming
MC methods differ from DP methods in 3distinct ways:
Firstly, they do not assume complete knowledge of the environment
- DP methods require a complete world model that is often unavailable in practice. MC methods have no such requirement.
Secondly, they only require experience (simulated or actual)
- DP methods can use their perfect world model to calculate optimal value functions without ever interacting with the actual environment. MC methods, on the other hand, learn value functions from experience.
- The generated experience can either be simulated or actually from the environment. Often it’s possible to generate simulated experience according to some distribution even when it’s infeasible to obtain the distribution itself (e.g the game of Blackjack in which winning probabilities for specific hand combinations are difficult to compute but simulating a game is easy).
Thirdly, they do not bootstrap
- Estimates for each state are independent of other states
- This also means that the computational expense of MC methods is not dependent on the number of states
Monte Carlo Prediction
Recall the prediction problem where you have some policy π and you wish to estimate its state value function vπ. How can we do this with MC methods?
- One way would be to run multiple episodes of an environment following the policy π, then estimate the value of each state to be the average return after the first time the state was visited. We call this first-vist MC.
- Alternatively, we could also average the returns after every time the state was visited — called every-visit MC.
- As we increase the number of sample episodes, both approaches will converge to the true vπ.
Below is an algorithm for first-vist MC prediction
- Each step we are discounting G to represent discounted future rewards
- The ‘Unless’ statement is just checking to make sure this is the first visit for the state Sₜ, in which case we record the return and update the average
Implementation
Below is a Python implementation of first-visit MC prediction for the simple 4x4 grid world defined in the previous chapter. Implementations of the helper functions can be found on GitHub.
When run on the 4x4 grid world over 10k episodes it generates the following state values:
In this article, we learned about the basics of Monte Carlo methods, how they differ from Dynamic Programming methods, and how they can be used to solve the prediction problem. In the next article, we’ll look at how MC methods can solve the control problem.