Monte Carlo Methods — Learning from experience

The Perceptive Agent
Analytics Vidhya
Published in
5 min readApr 25, 2020

We have finally reached the start of the exciting stuff. The previous article gave a very brief look into learning the optimal policy when the agent was familiar with the model of the environment. But, it is not possible to have complete information about the environment’s dynamics in practice.
Monte Carlo (MC) methods provide a way to learn from raw experience.

Working of MC methods

MC methods do not require any knowledge of the environment. They require only experience: A sequence of states, actions and rewards obtained by interacting with the environment. These methods require that the experience is divided into episodes (we already what it is) and that each episode eventually terminates.
Monte Carlo methods use a simple idea. To compute the expected return, they compute the observed mean return for every state-action pair from the experience. It will be more clear in a moment. Keep reading 😊.

Monte Carlo Prediction

Monte Carlo prediction is learning the state value function for a policy. To re-iterate, the state-value function provides the expected return from a state, provided that the same policy is followed for subsequent states.
An obvious way to estimate it is simply averaging the returns observed after visits to that state. These observed visits are from the experience that is recorded by interaction with the environment. As more returns are observed, the closer the estimates will converge to the expected value.
In particular, suppose we wish to estimate Vπ(s), the value of a state s under policy π, given a set of episodes obtained by following π and passing through s. Each occurrence of state s in an episode is called a visit to s. Of course, s may be visited multiple times in the same episode; let us call the first time it is visited in an episode the first visit to s. The first-visit MC method estimates Vπ(s) as the average of the returns following first visits to s, whereas the every-visit MC method averages the returns following all visits to s.

The mean return of a state

where 1[St=s] is a binary indicator function. We may count the visit of state s every time so that there could exist multiple visits of one state in one episode (“every-visit”), or only count it the first time we encounter a state in one episode (“first-visit”). Similarly, the approximation could easily be applied to Q-value too:

MC Estimation of Q-values — An important point to remember.

If a model is not available, then it is better to estimate Q-values rather than state-values. With a model, one can simply look ahead by one step and chooses whichever action leads to the best combination of reward and next state.

The problem with acting greedily all the time

While collecting experience, many state-action pairs may never be visited. This generally happens in cases where the state space is very large. In a deterministic policy, a policy π will return only one action for each state. Without returns to average, the estimates of other states will not improve! This is not the optimal learning policy because we should be considering all the possible actions, not the one we currently favour.

Exploring starts: One way is to start an episode with a state-action pair and make sure that each state-action has a non zero probability of getting picked. Hence, theoretically, over the course of infinite episodes, each pair will be visited infinite number of times.
This doesn’t seem practical and amore practical approach will be described later in this article.

Monte-Carlo Control — Learning the optimal policy

To learn the optimal policy, we perform alternate steps of policy evaluation and policy improvement beginning with an arbitrary policy and ending with an optimal policy and optimal action-value function.

A typical Monte Carlo Control process

Here is a general idea of how this takes place:
1. Policy improvement is done by making the policy greedy with respect to the current value function. In this case, we have an action-value function, and therefore no model is needed to construct the greedy policy.

2. Policy Evaluation is done by experiencing an episode with the new policy. This new episode is then used to further improve the Q-value estimates using:

The above two steps are alternatively followed in pursuit of an optimal policy.

Monte Carlo Control without exploring starts

Exploring starts requires an infinite number of episodes. This is certainly not practical. To circumvent this, ϵ-greedy policies are used.

In ϵ-greedy policies, most of the time the action with the highest Q-value is chosen. But with a probability ϵ, they can select an action at random. This means that every time the agent has to choose an action, there is a probability ϵ that a random action will be taken and a 1-ϵ probability that the action with the highest value is taken.

ϵ-greedy is a popularly used way of taking an action in many RL algorithms. And it is really simple too.

On-Policy and Off-Policy methods

I will keep this short:
1. On-policy methods attempt to evaluate or improve the same policy that is used to make decisions. For example, if a policy π is used to generate experience, then that same policy is improved and that improved policy is used to generate new episodes. There is a singular policy that is generating experience and improving.
2. Off-Policy: All learning control methods face a dilemma. They seek to learn an optimal policy but also have to behave non-optimally to explore. On-policy approach learns action-values not for optimal policy but for a near-optimal policy that still explores. A more straightforward approach is to use 2 policies. One that is learned about and is aimed at becoming optimal. This is called the target policy. Another is exploratory and is used to generate behaviour. This is the behaviour policy. An advantage of this separation is that the estimation policy may be deterministic (e.g. greedy), while the behaviour policy can continue to sample all possible actions.

In the next article, I will be writing about Temporal Differencing methods and a few RL algorithms. The excitement is just beginning to surge and there are many exciting things ahead.

--

--