Key takeaways 5 (Eligibility Traces) — Sutton and Barto RL textbook

What you should know about the credit assignment problem, forward view λ-return and backward view TD(λ)

James Koh, PhD
MITB For All
8 min readSep 27, 2023

--

Photo by Timothy Newman on Unsplash

When I was learning this for the first time years ago, I was somewhat apprehensive initially. In my mind, I am wondering why there’s so many different notations, and why we’re still at Temporal Difference.

“Haven’t we already learnt DP, MC and TD, and are already equipped to perform online learning with our choice of on/off-policy action selection, with or without knowledge of the transitions? Why is there a never-ending extension, and if the earlier methods are incomplete, why is so much time spent on those materials?”, the younger me asked myself back then.

Believe me when I say I understand how you feel.

Now that I have to teach this myself, I understand why it is necessary to start with DP, MC and TD(0). I also see why it is important to learn about eligibility traces.

My objective here is to convince you of the relevance, and to help you appreciate these contents with as little resistance as possible.

n-step TD

I have already described this in one of my favorite article, titled ‘A Cornerstone of RL — TD(λ) and 3 Big Names’. Am I just repeating the same stuff?

Of course not! (and therefore, I will not paste any of the equations nor code over here).

That post published in Towards Data Science was meant to be separate from the textbook. It gives the motivation for moving from the vanilla TD(0) to n-step TD, and then to TD(λ). It includes mathematical derivation showing how λ=0 or λ=1 gives us the special cases we are familiar with, as well as the code for training and visualization.

I did not talk about the details of λ-returns vs TD(λ), in order to cater to practitioners rather than students. However, the concepts here should not be overlooked.

And so, here you go!

TD(λ)

Forward View (λ-return)

Note that I refer to the λ-return as Lₜ here, to be consistent with the in-progress version of the textbook, as explained at the beginning of this series.

The difference between MC, TD(0), n-step TD, and λ-return is simply the ‘target’ that we use to update our value function towards.

Equation by author.

Let us consider a specific segment of an episode as follows, with S₁₀₄ being a terminal state.

Example scenario. Image by author.

In this case, the λ-return at S₁₀₀ would be as follows, where γ is the discount factor as per the usual notations.

Updates via λ-return. Equation by author.

Take note of the last factor above — It is not a typo. As stated by Sutton and Barto in their textbook, all subsequent n-step returns are equal to Gₜ once the terminal state is reached. The coefficient is therefore the sum of infinite geometric series with common ratio λ.

It is only until we have completed the full episode that we will be able to compute the λ-return at S₁₀₀. Meanwhile, the λ-return at S₁₀₁ would be a weighted average of three n-step returns, while that at at S₁₀₂ involves two n-step returns, and so on.

Backward View — TD(λ)

In order to perform an online update, we use eligibility traces so that the value function for each state can be updated once they had been visited, without waiting for the episode to terminate.

The eligibility trace is essentially how much credit (or blame) will be assigned to previously visited states. If the eligibility trace is small, it means that the state had been visited quite some time ago, and hence it would have a weaker association with the current reward received.

Updates for eligibility trace. Equation by author.

Going back to the earlier example, the approach would result in S₁₀₀ being updated a total of four times (where the episode terminates at S₁₀₄). There are two things which you should pay attention to:

  1. Each time, the ‘error’ is with respect to a different reference, though in all cases the value S₁₀₀ is being updated. In the first update, it is between R₁₀₁ + γV(S₁₀₁), ie. the actual reward observed when transitioning from S₁₀₀ to S₁₀₁ added with the discounted predicted value at S₁₀₁, and V(S₁₀₀), ie. our predicted value at S₁₀₀. In the second update, it is between R₁₀₂ + γV(S₁₀₂) and V(S₁₀₁). Each time, it corresponds to the information obtained from the most recent knowledge discovery.
  2. The weights decay by a factor of γλ on each successive step. This can be explained intuitively. Not only are the weights associated to each n-step return decaying by a factor of λ cumulatively, the rewards after n steps (and any associated errors with it) are to be discounted by a factor of γⁿ.
Updates via TD(λ). Equation by author.

Note that while not shown above, the TD(λ) method would result in S₁₀₁ being updated thrice, S₁₀₂ updated twice, and S₁₀₃ updated once.

Comparison of Forward and Backward views

The λ-return and TD(λ) work by the same principles; they both take a weighted average of n-step TD returns, where the weights decrease as n increases.

Things will be clear once you run the following to see for yourself. You can change the values of R_101 to R_104, the initial estimates V_100 to V_103 (but V_104 must be zero), and of course the hyperparameters.

R_101 = 1
R_102 = 3
R_103 = -2
R_104 = 50

# Just some initialized values
V_104 = 0
V_103 = R_104
V_102 = R_103+R_104
V_101 = R_102+R_103+R_104
V_100 = R_101+R_102+R_103+R_104

y = 0.9 # gamma; discount
alpha = 0.1 # learning rate
lambd = 0.5 # lambda in TD(lambda)

onestep = R_101 + y*V_101
print("1-step TD: %.2f" % onestep)
print("After update towards %.2f, new V(s) is %.2f" % (onestep, V_100 + alpha*(onestep-V_100)) )

MC = R_101 + y*R_102 + (y**2)*R_103 + (y**3)*R_104
print("Monte Carlo: %.2f" % MC)
print("After update towards %.2f, new V(s) is %.2f" % (MC, V_100 + alpha*(MC-V_100)) )

L100 = (1-lambd) *(R_101 + y*V_101) \
+ (1-lambd)*lambd *(R_101 + y*R_102 + (y**2)*V_102) \
+ (1-lambd)*lambd**2 *(R_101 + y*R_102 + (y**2)*R_103 + (y**3)*V_103) \
+ (1-lambd)*lambd**3 *(R_101 + y*R_102 + (y**2)*R_103 + (y**3)*R_104)
print("𝜆-return L100: %.2f" % L100)
print("After update, new V(s) is %.2f" % (V_100 + alpha*(L100-V_100)) )

# V0, V1, V2, V3, and V4 are all referring to V_{100}, at different iterations
V0 = V_100
V1 = V0 + alpha * ( 1 * (R_101 + y*V_101 - V_100) )
V2 = V1 + alpha * ( (lambd*y) * (R_102 + y*V_102 - V_101) )
V3 = V2 + alpha * ( (lambd*y)**2 * (R_103 + y*V_103 - V_102) )
V4 = V3 + alpha * ( (lambd*y)**3 * (R_104 + y*V_104 - V_103) )

print("After update with eligibility traces, new V(s) is %.2f" % V4)

Take note that there is no free lunch when it comes to practical implementation. Even though using a weighted average sounds like the best bet, it comes at the expense of increased computations. This means that we could have performed a larger number of iterations if TD(0) or MC had been used.

Sarsa(λ)

This is effectively just a nice name for what you had just learnt in the preceding section. Earlier, we were dealing with state-value function V(s). You can apply the exact same concept to Q(s,a), and learn the action-value function.

The difference is that you store an eligibility trace for each unique state-action pair, that is, you will have E(s,a) instead of E(s).

Others

In Section 7.6, you will see Watkin’s Q(λ), which is not covered in the course. The horizon here differs from TD(λ) and Sarsa(λ), which spans all the way until the end of the episode. Instead, Watkin’s Q(λ) looks only up till the next exploratory action. Suppose all actions up till Aₜ is greedy. Watkin’s Q(λ)updates Qₜ(Sₜ, Aₜ) based on the best possible action (which was not taken) in state S_{t+1}, and goes no further. All subsequent actions thereafter has no influence.

In Section 7.7, the authors raise two limitations associated with this. The first is that the binary treatment (either a policy is followed, or it is not followed and the eligibility traces gets cut off) may not be the best approach, given that actions actually come from probability distributions. The second is the lack of decoupling between bootstrapping and off-policy.

Implementation issues are discussed in Section 7.8, on whether the use of eligibility traces add to much computational overhead. In a single sentence, we can live with it.

Finally, the authors briefly touched on the idea of having a non-constant λ, in Section 7.9. For example, we can have λₜ = λ(t) or λₜ = λ(Sₜ). The premise is that it makes sense for λ to be small when a state is known with certainty, and large otherwise. However, Sutton and Barto wrote that such an approach is not used in practical applications. I personally have no plans to implement this anytime soon; I for one do not believe in adding bells and whistles just for the sake of it.

Conclusion

In this chapter, we see how eligibility traces can be adopted into TD, and how it extends to SARSA as well.

This is the 5th and final article for my series of commentary, for now. Reason being that the rest of the Introduction to Reinforcement Learning course covers other topics outside this textbook. Note that there are other chapters of this very good textbook. However, I only talked about materials which I used as class reference. Maybe after marking the assignments and projects and exams, I will find some time to continue this series.

It is certainly not to say that there are materials which Sutton and Barto had left out. Instead, it is because I want to teach students other technical skills which would be helpful at their workplace. Not every problem is meant to be approached with the RL lenses.

The introductory concepts here should prepare you well for diving further into the more advanced aspects of Reinforcement Learning. Again, I want to emphasize that this is meant to be a supplementary to the textbook. Also, it certainly does not replace attending my course in person!

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.

--

--

James Koh, PhD
MITB For All

Data Science Instructor - teaching Masters students