Temporal Difference Learning in Reinforcement Learning
Introduction
Temporal Difference Learning (TD Learning) is one of the central ideas in reinforcement learning, as it lies between Monte Carlo methods and Dynamic Programming in a spectrum of different Reinforcement Learning methods.
In this article, we will do an in-depth walkthrough of temporal difference learning and see why it has proved to be one of the most fundamental ideas in Reinforcement Learning.
We will start our exploration by discussing the prediction (policy evaluation) problem and then explore the control problem (finding an optimal policy).
TD Prediction
As we saw for Monte Carlo methods, Prediction refers to the problem of estimating the values of states, a value of a state is an indication of how good is that state for an agent in the given environment, the higher the value of the state the better it is to be in that state.
Monte Carlo and Temporal Difference Learning are similar in the sense that they both use real-world experience to evaluate a given policy, however, Monte Carlo methods wait until the return following the visit is known which is after the episode ends is available to update the value of the state, whereas TD methods update the state value in the next time step, at the next time step t+1 they immediately form a target and make a useful update using the observed reward.