Learning From Samples: Monte Carlo

Sai Sasank
The Startup
Published in
4 min readJan 14, 2021

The code presented here is to help understand the ideas discussed. Therefore, I may have removed some details of the implementation. For complete code, check my GitHub repository.

Learning from samples. The DP methods discussed earlier leverage a distribution model to calculate the optimal value function and an optimal policy. In this post, we will do away with such models. In many applications, it is easier to obtain samples of agent-environment interactions than a precise model that captures the environment’s dynamics. To this end, we will look at Monte Carlo methods in particular and here’s how an MC agent is typically initialized.

A typical MC agent initialization.

Monte Carlo methods. The idea of Monte Carlo methods is to calculate the complete sample returns for each state (or state-action pair) and average such returns over several episodes. This is in contrast to using expected returns and will therefore require tasks to be episodic.

Let us now look at the first-visit MC prediction algorithm which is used to estimate value functions. An episode is generated using the policy to be evaluated. Each state’s (or state-action pair’s) value is the return obtained from the state (or state-action pair) obtained by summing the rewards from the state (or state-action pair) to the end of the episode. We only consider the first occurrence of the state (or state-action pair) which is why it is termed the first-visit algorithm. This process is repeated indefinitely by generating episodes and the state’s (or state-action pair’s) value is updated as an average of the sample returns. This ability to learn from samples can be a significant advantage as the dynamics function can be hard to calculate for many environments.

We will focus on state-action values (action values for short) since they are found to be more useful when a model is lacking. Below is an implementation for first-visit MC prediction.

Implementation of first-visit MC prediction algorithm for an individual episode.

Maintaining sufficient exploration. In the prediction algorithm, the episodes are generated using the agent’s policy. If we use a deterministic policy, then many state-action pairs may never be visited and action value estimates of such state-action pairs may never improve. This is an exploration problem we need to address.

One way to solve this is to use exploring starts. Each episode starts with a random state-action pair where all state-action pairs are equally likely and the rest of the episode is generated using the policy.

Another way to solve the exploration problem is to use a stochastic policy where every action in every state has some non-zero probability. This idea gives way to two different methods: on-policy and off-policy methods. On-policy methods improve the same policy that generates experience whereas the off-policy methods improve a policy different from the one used to generate experience.

On-policy learning using epsilon-greedy. We will use on-policy learning so we effectively have one policy to generate episodes and the same policy is improved using the generate episodes. The policy we use will be epsilon-greedy so that the exploration aspect is taken care of. While implementing, we need not maintain an explicit policy. An epsilon greedy function is defined implicitly using the current action values. Below is the implementation.

The epsilon-greedy policy to select an action using the action values.

The Maze Environment. For our experiment, we will use a maze as the environment where the agent needs to reach the goal position. Every step has a reward of -0.004 and reaching the goal has a reward of +1. The movement in the maze is deterministic and each episode ends when the goal is reached or when 2000 steps are taken. Below is the initial state of the maze.

Initial state of the environment, agent starts at the blue cell while the red cell is the goal.

Experiment. There are mainly two parameters I would like to discuss. Epsilon, which controls the degree of exploration, and the amount of experience.

Epsilon is part of the agent’s policy which is used to generate experience the agent learns from and then used to act in the environment. Low epsilon causes high exploration resulting in diverse experience to learn from. However, once the action values are converged, the policy remains highly explorative resulting in sub-optimal returns. On the other hand, high epsilon causes low exploration requiring more experience to converge. Once the action values converge, the policy is only slightly explorative resulting in near-optimal returns.

To evaluate the effect of epsilon and amount of experience, I have trained 25 different agents with varying epsilon and experience size. Epsilon varies from 0.1 to 0.9. Experience size varies from 10 to 1000. Finally, each agent is evaluated on 200 episodes in the environment where its total return is calculated.

Average returns for different MC Agents. In each cell, the average return over a 1000 runs is shown.

This result is expected. Low epsilon agents converge faster but achieve lower returns due to high exploration. High epsilon agents converge slower but achieve higher returns.

Below we see two Monte Carlo agents solving the maze. The one on the left has low epsilon and less experience (0.2, 70). The one on the right has high epsilon and more experience (0.9, 1000).

Comparison of two Monte Carlo agents. One on the left has low epsilon and less experience. One on the right has high epsilon and more experience.

If you are interested in reproducing the results or doing your own experiments, the code is available here.

--

--