4. Temporal-Difference Learning

Javier Abellán Abenza
Neurosapiens
Published in
2 min readJan 5, 2019

This is the lesson 4 of the reinforcement learning course.

Photo by Alberto Frías on Unsplash

In Monte Carlo, once an episode has concluded, every state that was present can update its value function towards that end return.

However, a problem lies in that every state updates towards the end return, even though that individual state may have had no impact whatsoever on how much reward there was at the end. In order to reduce the variance, we can use a different method of prediction.

Temporal-difference learning, or TD learning, updates the values of each state based on a prediction of the final return. For example, let’s say Monte-Carlo learning takes 100 actions and then updates them all based on the final return. TD learning would take an action, and then update the value of the previous action based on the value of the current action. TD learning has the advantage of updating values on more recent trends, in order to capture more of the effect of a certain state.

TD has a lower variance than Monte-Carlo, as each update depends on less factors. However, Monte-Carlo has no bias, as values are updated directly towards the final return, while TD has some bias as values are updated towards a prediction.

TD and Monte-Carlo are actually opposite ends of a spectrum between full lookahead and one-step lookahead. Any number of steps can be taken before passing the return back to an action. One step would be the previously discussed TD learning, and infinite steps is Monte-Carlo.

--

--

Javier Abellán Abenza
Neurosapiens

M.S. Computer Science student interested in deep learning