In the first part of Temporal Difference Learning (TD) we investigated the prediction problem for TD learning, as well as the TD error, the advantages of TD prediction compared to Monte Carlo Methods, and learned about the Optimality of TD(0).

With that in mind, we now want to solve the control problem using TD learning methods. Therefore I plan to introduce three algorithms:

  • on-policy SARSA
  • off-policy Q-learning
  • Expected SARSA

As in the previous articles, the introduced algorithms are implemented for a Gridworld example and can be viewed on GitHub

So let's start…

TD Control: SARSA

As with Monte Carlo Control methods for TD Control methods we differentiate between on-policy and off-policy control methods. Moreover, as for MC methods, a tradeoff between exploration and exploitation has to be made.

The first TD control method we want to look at is the on-policy SARSA Algorithm. For the control problem and SARSA, we learn the action-value function instead of the state-value function for the TD prediction problem.

To refresh what the difference between these functions is I will refer to an older article…

--

--

Sebastian Dittert
Analytics Vidhya

Ph.D. student at UPF Barcelona for Deep Reinforcement Learning