Reinforcement Learning Chapter 5 — Monte Carlo Methods (Part 3: MC without Exploring Starts)
Chapter 5 Series:
- Part 1 — Monte Carlo Prediction
- Part 2 — Monte Carlo Control
- Part 3 — MC without Exploring Starts
- Part 4 — Off-policy via Importance Sampling
Code: https://github.com/nums11/rl
In the previous article, we learned about a MC approach for control that requires the assumption of exploring starts to address the problem of maintaining exploration. In this article, we’ll learn a different solution to the problem that allows us to remove this assumption.
On-Policy vs. Off-Policy Methods
Before we look into how we can remove the assumption of exploring starts we must first understand the difference between on-policy and off-policy methods.
On-policy methods
- Evaluate and improve a policy that is the same policy used to generate data
- Generally simpler
Off-policy methods
- Evaluate and improve a policy different from the one used to generate data
- Require additional notation
- Greater variance and slower to converge
- More powerful and more general
Epsilon Greedy
We’ll dive more into off-policy methods later, but for now, understand that the previous algorithm we defined for control — MC Control with Exploring Starts — is an on-policy method.
We can remove the assumption of exploring starts and stay on policy by changing our policy to an epsilon-greedy one.
- Recall that an epsilon-greedy policy is one that exploits (chooses a deterministic action) with probability 1 - ε and explores (chooses a random action) with prob ε.
- This solves the problem of maintaining exploration since it ensures that all state-action pairs have a nonzero probability of being visited.
Below is pseudo-code for the On-policy First-Visit MC Control algorithm
- The last few lines are simply stating to select the action based on an epsilon-greedy policy
Implementation
Below is a Python implementation of MC control with an epsilon greedy policy and without exploring starts for the simple 4x4 grid world defined in the previous chapter. Implementations of the helper functions can be found on GitHub.
When run on the 4x4 grid world with an epsilon value of 0.1 over 100k episodes it generates the following policy. As opposed to the exploring starts version it learns the optimal policy:
In this article we learned about the difference between on-policy and off-policy methods and how we can remove the assumption of exploring starts while staying on policy by using an epsilon-greedy policy. In the next article we will take a look into off-policy solutions.