Reinforcement Learning Chapter 5 — Monte Carlo Methods (Part 3: MC without Exploring Starts)

3 min readApr 12, 2023

Chapter 5 Series:

In the previous article, we learned about a MC approach for control that requires the assumption of exploring starts to address the problem of maintaining exploration. In this article, we’ll learn a different solution to the problem that allows us to remove this assumption.

On-Policy vs. Off-Policy Methods

Before we look into how we can remove the assumption of exploring starts we must first understand the difference between on-policy and off-policy methods.

On-policy methods

Evaluate and improve a policy that is the same policy used to generate data
Generally simpler

Off-policy methods

Evaluate and improve a policy different from the one used to generate data
Require additional notation
Greater variance and slower to converge
More powerful and more general

Epsilon Greedy

We’ll dive more into off-policy methods later, but for now, understand that the previous algorithm we defined for control — MC Control with Exploring Starts — is an on-policy method.

We can remove the assumption of exploring starts and stay on policy by changing our policy to an epsilon-greedy one.

Recall that an epsilon-greedy policy is one that exploits (chooses a deterministic action) with probability 1 - ε and explores (chooses a random action) with prob ε.
This solves the problem of maintaining exploration since it ensures that all state-action pairs have a nonzero probability of being visited.

Below is pseudo-code for the On-policy First-Visit MC Control algorithm

The last few lines are simply stating to select the action based on an epsilon-greedy policy

Implementation

Below is a Python implementation of MC control with an epsilon greedy policy and without exploring starts for the simple 4x4 grid world defined in the previous chapter. Implementations of the helper functions can be found on GitHub.

When run on the 4x4 grid world with an epsilon value of 0.1 over 100k episodes it generates the following policy. As opposed to the exploring starts version it learns the optimal policy:

In this article we learned about the difference between on-policy and off-policy methods and how we can remove the assumption of exploring starts while staying on policy by using an epsilon-greedy policy. In the next article we will take a look into off-policy solutions.

Reinforcement Learning Chapter 5 — Monte Carlo Methods (Part 4: Off-policy via Importance Sampling)

In the previous article, we learned about the difference between on-policy and off-policy methods. In this article, we…

medium.com