SARSA & Q Learning in Temporal Difference for Reinforcement Learning with example

Mehul Gupta
Data Science in your pocket
4 min readMar 12, 2020

--

In continuation of my previous posts, I will be focussing on Temporal Differencing & its different types (SARSA & Q Learning) this time.

But what is this Temporal Difference Learning

As per Reinforcement Learning Bible (Sutton Barto):

TD learning is a combination of Monte Carlo and Dynamic Programming.

The relationship between TD, DP, and Monte Carlo methods is a recurring theme in the theory of reinforcement learning.

What does this mean though?

Do go through Formulating RL problems & Monte Carlo for RL before moving on!!

Recollecting from my previous post, below is the problem we solved last time using Monte Carlo

The issue we can observe easily is that we always need a termination state!!

If such is the case, what will happen to Continuous RL problems that don’t have a termination state!!

Also, why should we wait to update Value-Action-Function for all states till the episode end? Can it be done before the episode ends? can be painful when we have 1000s of states

Here comes Temporal Difference Learning which

  • Doesn’t require any info about the environment (Like Monte Carlo)
  • Update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap like DP).

Hence,

Temporal Difference= Monte Carlo + Dynamic Programming.

In Temporal Difference, we also decide on how many references we need from the future to update the current Value-Action-Function.

What does this mean?

It means we can update our present Value-Action-Function using as many future rewards as we want. It can be just one future reward TD(0) from the immediate next future state or can be 5 future rewards from the next 5 future states i.e. TD(5). The onus is completely on us. Though I would be using TD(0) in the below examples.

Let’s go through the algorithm for Temporal Difference Learning:

Going step by step:

  • Input π i.e. the policy (can be e-greedy, greedy, etc.)
  • Initialize Value-Action-Function for every state(s belonging to S) in the environment
  • for e →E (episodes/epochs we want to train):

___1.Take the initial state of the system

___2.For each step in the episode

_______A.Choose Action according to the policy π.

_______B.Update Value-Action-Function according to the current step chosen using the mentioned equation and move to the next state S’.

It’s time to demystify the ghostly update equation now:

V (S) ← V (S) + α[R + γV (S′) − V (S)]

Here,

V(S)/V(S,A)= Value-Action-Function for current state

α= Constant

R=Reward for present action

γ= Discount Factor

V(S’)/V(S’/A)=Value-Action-Function for next State when action A taken on state S

We have discussed the equation for Value-Action-Function updation.

But which Value-Action function to choose for the next State(S’)?

Depending on how the Value-Action for the next state is chosen, we have 2 different types of Temporal Difference Learning:

SARSA (State–action–reward–state–action):

It is an on policy Temporal Difference Learning where we follow the same policy π for choosing the action to be taken for both present & future states.

Recollecting the maze game,

Suppose we are on block (900,900). If by following the e-greedy policy, we choose to move left to (700,900) and get Reward=R, then to update the Value-Action function for (900,900), we would consider the Value-Action function for (700,900) with the action chosen using e-greedy policy only (hence the policy remains same when deciding rewards for future & present). Let the action chosen for (700,900) is Up using e-greedy policy, then

V ((900,900),Left) ← V ((900,900),Left) + α[R + γV ((700,900),Up) − V ((900,900),Left)]

Do remember that we can choose any policy but the same has to be used to select the Value-Action-Function for present & all future states as well and hence called ‘On Policy’ Temporal Difference as we aren’t compromising with the policy chosen with any other policy for learning at any time.

Q Learning

It is an off-policy Temporal Difference Learning method where we follow different policies for choosing the action to be taken for both present & future states.

Consider we have 2 policies P1=e-greedy (for present state), P2=greedy (for a future state, choose action with highest reward)

Suppose we are on block (900,900). If by following the e-greedy policy, we choose to move left to (700,900) and get Reward=R. Then to update the Value-Action function for (900,900), we would consider the Value-Action function for (700,900) with the action chosen using greedy policy only i.e. the action for which value function is maximum. Hence if the Value-Action is highest for moving right to (900,900), then

V ((900,900),Left) ← V ((900,900),Left) + α[R + γV ((700,900),Right) − V ((900,900),Left)]

& Finally dropping off for now !!

--

--