Monte Carlo for Reinforcement Learning with example

Mehul Gupta
Data Science in your pocket
4 min readMar 4, 2020

--

The first thing that comes to our mind when we hear MONTE CARLO is

Clothing!!

But we won’t be discussing how good the Monte Carlo brand is, its price range or quality, etc.

My debut book “LangChain in your Pocket” is out now !!

Things are a bit more serious than expected !!

If we aren’t talking about MONTE CARLO the brand, then which Monte Carlo are we talking about here?

My last couple of posts briefed about the basics of RL alongside formulating a Reinforcement problem. This time, we will be discussing

MONTE CARLO method for Reinforcement Learning

What is meant by Monte Carlo?

The term “Monte Carlo” is often used more broadly for any estimation method whose operation involves a significant random component.

Monte Carlo methods require only experience — sample sequences of states, actions, and rewards from actual or simulated interaction with an environment. Learning from actual experience is striking because it requires no prior knowledge of the environment’s dynamics, yet one can still attain optimal behavior. I will be simplifying stuff below!!

What is this Monte thing used for in RL?

It is a method for estimating Value-action(Value|State, Action) or Value function(Value|State) using some sample runs from the environment for which we are estimating the Value function.

For a better understanding, do refer here.

We will be discussing Monte Carlo for episodic RL problems(one with terminal states) and not for Continuous(No terminal state) problems.

Do you remember how did we solve Multi-Armed Bandit problems?

Just revisit the above-mentioned post & you will have an idea.

What we did is we followed the e-greedy policy and keep on averaging the reward(for chosen bandit) every time we chose a Bandit. We assumed that averaging reward for bandits helps us estimate the value function for a given bandit accurately as the value function will saturate after some iterations.

Can averaging of rewards received/state be a solution for multi-state problems as well?

Yes!!! this is what lies in the heart of Monte Carlo method.

In Monte Carlo, we are given some example episodes below

  • Let us consider the above situation where we have a system of 3 states that are A, B & terminate.
  • We are given two example episodes(we can generate them using random walks for any environment).
  • A+3 →A+2 means the transition from state A →A with reward =3 for this transition.

Now, we know that averaging rewards can get us value-function for multi-state RL problems as well.

But things aren’t this easy as we know value-function depends on future rewards as well.

Hence we have got 2 types of Monte Carlo learning on how to average future rewards:

  • First Visit Monte Carlo: First visit estimates (Value|State: S1) as the average of the returns following the first visit to the state S1
  • Every Visit Monte Carlo: It estimates (Value|State: S1) as the average of returns for every visit to the State S1.

CONFUSED !!!!

Examples make life easy.

Let's turn back to the above environment mentioned

We will be Calculating V(A) & V(B) using the above mentioned Monte Carlo methods

First Visit Monte Carlo:

  • Calculating V(A)

As we have been given 2 different iterations, we will be summing all the rewards coming after A (including that of A) after the first visit to ‘A’.

Therefore, we can’t have more than one summation_term/episode for a state.

Hence,

  • For 1st episode=3+2+-4+4+-3=2
  • For 2nd episode=3+-3=0

As we have got two terms, we will be averaging these two value i.e

V(A)=(2+0)/2=1

Note:It must be noted that if an episode doesn’t have an occurence of ‘A’, it won’t be considered in the average.

Hence if a 3rd episode like B-3 →B-3 →terminate existed, still V(A) using 1st Visit would have been 1

Calculating V(B)

Drawing reference from the above example:

  • 1st episode=-4+4–3=-3
  • 2nd episode=-2+3+-3=-2

Averaging, V(B)=(-3+-2)/2=-2.5

Every Visit MC:

Calculating V(A)

Here, we would be creating a new summation term adding all rewards coming after every occurrence of ‘A’(including that of A as well).

  • From 1st episode=(3+2+-4+4+-3)+(2+-4+4+-3)+(4+-3)=2+-1+1
  • From 2nd episode=(3+-3)=0

As we got 4 summation terms, we will be averaging using N=4 i.e

V(A)=(2+-1+1+0)/4=0.5

Calculating V(B)

  • From 1st episode=(-4+4+-3)+(-3)=-3+-3
  • From 2nd episode=(-2+3–3)+(-3)=-2+-3

As we have 4 summation terms, averaging using N=4,

V(B)=(-3+-3+-2+-3)/4=-2.75

Hanging up my boots!!

--

--