Reinforcement Learning: Implementing TD(λ) with function approximation

A unicorn of an algorithm combining the benefits of Monte Carlo, single-step Temporal Difference methods and the n-steps in between

Ruo Xi Yong
MITB For All
9 min readDec 20, 2023

--

This article was co-authored with Shuxian, and the content within this article was submitted in fulfilment of the requirements for the SMU MITB’s Introduction to Reinforcement Learning module. Permission was granted to the authors to reproduce the assignment context and our submissions for this article.

Breaking out of Grid World with TD(λ). Source: Unicorn icons created by Freepik — Flaticon

It’s game night, and you’re teaming up to play a game you’ve never seen before. Would you pick a partner who remembers every move? Or someone who is safe and consistent? Or a bold opportunist?

Well, why not all?

Enter TD(λ) — a general reinforcement learning approach that covers a broad spectrum of methods ranging from Monte Carlo to SARSA to Q-Learning. The implementation of TD(λ) for discrete state spaces has been well-covered, so we’re going to tackle the topic of how to get it working in a continuous state space instead.

In this article, we will be

  • Revisiting TD(λ) and the concept of eligibility traces
  • Implementing TD(λ) with function approximation
  • Conquering GridWorld with TD(λ)

TD(λ) in a jiffy

We’re going to assume familiarity with the Monte-Carlo algorithm, single-step TD and n-step TD methods and focus only on the essentials to understand our implementation later on. There are plenty of Medium articles (ahem, we recommend our Prof’s A Cornerstone of RL — TD(λ) and 3 Big Names and Key Takeaways 4 (Temporal Difference Learning) — Sutton and Barto RL textbook) and of course, the holy grail Sutton & Barto textbook that do great jobs in explaining the prerequisite knowledge.

In a forward view manner, TD(λ) uses the weighted average (Lₜ) of n-step TD returns as the target to update the value of the current state Sₜ.

All equations in this article typed by the authors in CodeCogs and saved as images.

Given that the n-steps are ahead of the current time step t, we have to wait till an episode completes before we can update the value of the current state Sₜ. That’s not very ideal in a continuous state space since it might take forever for the agent to finish an episode, which means it’s not going to learn anything until then.

Instead, we can do online learning by looking backwards. For the current state Sₜ, we update the values of all visited states using eligibility traces. An eligibility trace assigns how much credit a previously visited state contributes to the current reward, and therefore how much the value of that visited state should be updated by the current TD error. States visited a long time ago should have less influence on the current reward and hence be updated marginally.

The state-value update equation can also be expressed as

In this form of equation, we see that the TD error is simply the difference between the target output at time step t and the predicted state-value.

The pseudocode below sums up nicely how we might implement on-line TD(λ) tabularly for a discrete state space. But that’s not what we’re interested in right? Read on!

Pseudocode for implementing on-line tabular TD(λ). Source: Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.) (pp. 178). The MIT Press. [1]

Implementing TD(λ) with function approximation

In a continuous state space, the number of states is infinite — which means we’ll never get the exact state-value for every single state since it’s an impossible task to visit them all. The best that we can do is to estimate the value of states that are similar to each other. Each state can be described as a combination of features, as a result, the action-value function could be redefined as a weighted sum of features.

Here, we can either manually define the features heuristically e.g., in a grid world, the Manhattan distance of the agent from the goal, or represent the action-value function as a neural network and let feature engineering be done by the network itself.

Neural network illustration for d-dimensional state input and action-value function as output. Image by author using PowerPoint.

We’re going to use a neural network to approximate the action-value function. The gradient descent equation for neural network parameters is given by

Let’s rewrite the TD(λ) state-value update equation in the previous section for the action-value function since we ought to approximate the action-value function to determine a policy.

Our objective is to improve the estimate of Q(s, a) by minimizing the difference between the target and predicted action-values. We can use the mean squared error between these two values as the loss function J(θ).

The gradient descent equation for TD(λ) would hence be

Enough with theory, let’s move on to the actual stuff.

TD(λ) in action in a continuous GridWorld

Given a 2D “world” (the environment) spanning from coordinates (0,0) to (10,10), our task is to train an agent to reach the goal while maximizing its rewards. Here are the ground rules:

The continuous GridWorld environment

And now, the agent. We will implement it in 3 steps

  1. Step 1: Build the neural network
  2. Step 2: Define action selection strategy
  3. Step 3: The all-in-one TD(λ) algorithm

Here’s the full python code in GitHub.

Step 1: Build the neural network

This part is fairly simple. We use PyTorch to build a neural network with 3 hidden layers with ReLU activation, which is sufficient for the continuous GridWorld problem. The architecture is entirely up to you and will depend on your use case.

Next, we create a TemporalDifference class for the actual TD(λ) algorithm and initialize two identical neural networks, Q_main and Q_target, with an Adam optimiser.

Hang on… why do we need two neural networks?

While Qₜ refers to the target output obtained at time step t, we bear in mind that the target output comprises the immediate reward plus the discounted estimate of future returns obtainable from the next state until the agent reaches the goal. That estimate is also a prediction, just like Q(Sₜ, Aₜ; θ). Technically, we can obtain Qₜ and Q(Sₜ, Aₜ; θ) from the same network BUT since the network weights are updated every step, then, as Q(Sₜ, Aₜ; θ) improves Qₜ will change (not necessarily improve) and we will literally be chasing a moving target with no end.

Having 2 neural networks allows us to decouple Q(Sₜ, Aₜ; θ) and Qₜ to some extent for the calculation of mean squared error so that the training becomes more stable — an idea borrowed from Deep Q-Learning (DQN). We have

  1. Q_main, the main network that we will train to get Q(Sₜ, Aₜ; θ). As training occurs at every step, we can expect fluctuations here.
  2. Q_target, the secondary network which we will use to obtain Qₜ. This network is not trained but rather, updated periodically (a soft update) with the learnt weights from Q_main to help it improve but with lesser fluctuations than Q_main.

That’s the TL;DR, and we highly recommend Reinforcement Learning Explained Visually (Part 5): Deep Q Networks for a more in-depth explanation.

Step 2: Define action selection strategy

We’ll start with something really simple, the ε-greedy policy. This allows us to exploit the best known actions majority of the time while reserving a small fraction (ε) for exploring other actions in the hopes of finding a better one.

Again, the action selection strategy is entirely up to you and depends on the nature of the problem.

Step 3: The all-in-one TD(λ) algorithm

Here’s where we create the unicorn and give it wings. Broadly, the key parts are:

Part 1: Training the agent and collecting samples

  • We set a step_limit so that the agent will only commit “good” solutions to memory (i.e. episodes where the agent reaches the goal in less than step_limit), else the agent might go on forever without reaching the goal given the infinite nature of the continuous state space.
  • We use a memory buffer to collect the outputs (current state and action, reward obtained, next state and action) of each step, from which we randomly sample a batch which will be used to train the neural network later. Random sampling helps to break temporal dependencies.
  • Eligibility traces are implemented here. We use a dictionary to keep track of the trace values for the visited states and actions (note: tabular methods can’t be used in this case because of the continuous state space). For each state visit, we increment the trace of the current state by 1 and decay the trace for other states visited in the past. This is where the magic happens, because with just a few tweaks to the parameters, we get a whole spectrum of methods!
Parameters for various TD algorithms
  • Eligibility traces are reset every episode since episodes are independent of each other.

Part 2: Replaying the batch of samples to train the neural networks

Here, we update the neural network using the batch of samples gathered earlier. This batch update method is preferred because it is generally more stable than single-sample updates (the latter has high variance and may lead to slow or non-convergence).

This is also where we compute Qₜ , where

  • If the agent is acting on policy, then Q(Sₜ₊₁, Aₜ₊₁) will be obtained from Q_target network using the next action taken (by policy), i.e.
  • If the agent is NOT acting on policy, then Q(Sₜ₊₁, a) will be will be obtained from Q_target network using the action that gives the best Q-value, i.e.

We get Q(Sₜ, Aₜ; θ) from the Q_main network, and multiply both Qₜ and Q(Sₜ, Aₜ; θ) with the eligibility trace Eₜ(s, a) to compute the mean squared error as the loss function J(θ).

Finally, we update the Q_main and Q_target network weights.

And… we are done! By varying λ, we can create different agents and train each agent on 10,000 episodes.

Results

Here’s a sample trajectory!

Conclusion

In this article, we covered how neural networks can be used as a function approximator for TD(λ), which is particularly useful for tackling continuous state spaces. We also saw the versatility of TD(λ) — by varying λ and implementing eligibility traces, we can produce a whole spectrum of agents.

We hope you’ll find this useful in implementing your own TD(λ) agent!

References

[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction (2018), MIT press

All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.

--

--