Key takeaways 4 (Temporal Difference Learning) — Sutton and Barto RL textbook

What you should know about TD(0), SARSA and Q-learning

Published in

MITB For All

8 min readSep 17, 2023

We now have all the foundations needed to learn about Temporal Difference learning. This is the go-to approach when formulating target returns with which to train RL agents. By the end of this article, you will learn about TD(λ), which is a general form where you can adjust λ such that the Monte Carlo approach learnt previously is simply a special case.

General is definitely good (not talking about the army General here..) Why? Because we can use just a single set of code and simply change the args to have it work in different ways!

Temporal Difference Learning (Chapter 6)

Just like in Chapter 5, we distinguish between prediction and control in the same way . When we say prediction (in this context), we are trying to determine the value function corresponding to a policy. Meanwhile, control is about learning a policy.

TD Learning can be thought of as an improvement over Monte Carlo (MC) methods and dynamic programming (DP). Like MC methods, TD can “learn directly from raw experience without a model of the environment’s dynamics.” And, like DP, they “update estimates based in part on other learned estimates, without waiting for a final outcome”.

I have chosen to quote Sutton and Barto directly above, because their definition is clear and succinct. The ‘updating estimates based on estimates’ is what bootstrapping is about (that’s why the feature image has army boots).

Quick recap

Pardon me. I like to repeat important points to make sure everyone is on the same page. v_{\pi}(s) is the true value function corresponding to policy π. Without full knowledge of the environment, we can at best get an estimate V_{\pi}(s) of the value function.

In code, you can think of it as a hashmap or neural network, where it takes a particular state as input, and returns a predicted value. Visually, it is a curve/ surface/ hypersurface, depending on the dimensionality of s. Each point on V_{\pi}(s) is our prediction of the expected cumulative total rewards when the agent acts in accordance to policy π. Note that this already accounts for probability distributions and the inherent stochasticity.

Value function by default refers to the state value. We do not add the word ‘state’, like we would when referring to action value functions q(s,a).

The objective of ‘prediction’ in RL is to make V_{\pi}(s) close to v_{\pi}(s).

TD Predictions

Let’s start by using TD just to predict the value function of a particular policy. Whether using a tabular approach or neural networks, ‘targets’ are needed. It is what we want to work towards. In the former, we have:

The update rule for MC and TD. Equations typed by author.

where the values are updated for each state Sₜ encountered. The targets are denoted in red, which is Gₜ for MC methods. As for TD method, it is

Target for TD.

The values will be adjusted upwards if the target is larger, and downwards if it is smaller than what V(Sₜ) originally was—and either way that brings V(Sₜ) closer to the target. For neural networks, it would be minimizing a loss function and the math is not identical, but the concept remains — we move closer to the targets.

You can use one of the gridworld environments that I had shared earlier, for example in the On-Policy vs. Off-Policy Monte Carlo — that’s the whole idea of having modular code after all! To predict the value function using TD(0), we can simply the following class:

class TemporalDifference:
    def __init__(self, Env, alpha=0.1, gamma=0.95, epsilon=0.1):
        self.Env = Env
        self.alpha = alpha  # learning rate
        self.gamma = gamma  # discount factor
        self.epsilon = epsilon  # exploration rate
        self.state_dim = self.Env._get_state_dim()
        self.action_dim = 4  # Four actions: up, down, left, right 
        self.V_s = np.zeros(self.state_dim)
        self.Q_sa = np.zeros((self.state_dim[0], self.state_dim[1], self.action_dim))
    
    def random_policy(self, state):
        return np.random.choice(self.action_dim)
    
    def td_prediction(self, num_episodes):
        for _ in tqdm(range(num_episodes)):
            state = self.Env.reset()
            self.Env.done = False
            
            while not self.Env.done:
                action = self.random_policy(state)
                reward, next_state, done = self.Env.transition(deepcopy(state), action)
                td_error = reward + self.gamma * self.V_s[next_state[0], next_state[1]] - self.V_s[state[0], state[1]]
                self.V_s[state[0], state[1]] += self.alpha * td_error
                state = next_state
                self.Env.done = done

Similarly, you can train and visualize the results as follows.

agentTD = TemporalDifference(env)
agentTD.td_prediction(10000)
sns.heatmap(
    agentTD.V_s, annot=True, cmap='coolwarm', fmt='.1f'
)

Note that all that’s done here is to evaluate the value of the equiprobable random policy. There is no aspect of policy improvement, which is why the values are generally low.

Example 6.1 is one example of when we want to predict the value function v(s) instead of the action value q(s,a) — when the actions are pretty much limited. Here, we could think of the remaining minutes to reach home as the negative reward. For example, if I estimate that it takes 30 minutes to get home from state Sₜ, I can say that V(Sₜ) = -30. (note: Be careful with the usage of v and V. Small letter refers to the truth, and Capital letter refers to my estimate.)

It may be fair to say that there is not much choice of action when driving home. Our transition from one state to the next state is pretty much dependent on the environment (the traffic conditions), and we have to just que up along the road and drive, assuming that taking a detour wouldn’t help as every road is jammed during peak hour.

Comparing TD vs MC

Moving on to Section 6.2, we look at the advantages of TD(0).
1. It does not require knowledge of the environment, p(s',r|s,a), unlike DP which does.
2. Updates can be done online after each step, unlike MC which has to complete the episode.
3. Though not explicitly writing in this section, TD has low variance but at the expense of high bias.

The authors wrote that for any π, TD converges to v_{\pi} for a sufficiently small constant step size parameter. Put this together with the concept of generalized policy iteration, and we know the optimal policy can be obtained. In addition, it usually converges faster than constant-α MC methods

In Section 6.3, Sutton and Barto talked about batch updating. Here, the context is that a finite dataset is repeatedly used until the value function converges. Do not confuse this with the mini-batch gradient descent. In TD batch updating, the difference from the targets are all summed, and the value function is changed only once. This is why it convergences deterministically — there is only one way the math would turn out.

You would also see the mention of certainty-equivalence estimate. The concept of ‘certainty equivalent’ is broad and goes well beyond Reinforcement Learning. When there are unknowns, you take the best estimates of the uncertain variables and proceed as though they are definite. Don’t worry, you are not expected to make any mathematical proofs or derivations whatsoever. Just hear it, nod your head, and move on. Sutton and Barto acknowledged that “it is almost never feasible to compute (the certainty-equivalence estimate) directly”.

SARSA and Q-learning

In Section 6.4, you are introduced to SARSA, which derives its name from State — Action — Reward — (next) State — (next) Action. It is an on-policy TD control method, since learning is done based on the actual actions taken.

The TemporalDifference class above can be extended with the following methods.

class TemporalDifference:
    def epsilon_greedy_policy(self, state):
        if np.random.rand() < self.epsilon:
            return np.random.choice(self.action_dim)
        else:
            max_q_value = np.max(self.Q_sa[tuple(state)])
            max_indices = np.where(self.Q_sa[tuple(state)] == max_q_value)[0]
            return np.random.choice(max_indices)
    
    def sarsa_control(self, num_episodes):
        for _ in tqdm(range(num_episodes)):
            state = self.Env.reset()
            self.Env.done = False
            action = self.epsilon_greedy_policy(state)
            
            while not self.Env.done:
                reward, next_state, done = self.Env.transition(deepcopy(state), action)
                next_action = self.epsilon_greedy_policy(next_state)
                
                td_error = reward + self.gamma * self.Q_sa[next_state[0], next_state[1], next_action] \
                    - self.Q_sa[state[0], state[1], action]
                self.Q_sa[state[0], state[1], action] += self.alpha * td_error
                
                state, action = next_state, next_action
                self.Env.done = done
    
    def q_learning_control(self, num_episodes):
        for _ in tqdm(range(num_episodes)):
            state = self.Env.reset()
            self.Env.done = False

            while not self.Env.done:
                action = self.epsilon_greedy_policy(state)
                reward, next_state, done = self.Env.transition(deepcopy(state), action)

                best_next_action = np.argmax(self.Q_sa[next_state[0], next_state[1], :])
                td_error = reward + self.gamma * self.Q_sa[next_state[0], next_state[1], best_next_action] \
                    - self.Q_sa[state[0], state[1], action]
                self.Q_sa[state[0], state[1], action] += self.alpha * td_error

                state = next_state
                self.Env.done = done

The off-policy TD control method, Q-learning, is discussed in Section 6.5. Here, the learnt action-value function, Q, is independent of the policy being followed. The value of the next state, which contributes to the target, is derived from the best possible action that could be taken, based on the current prediction of Q.

It is actually easier to see this in code than in words.

For SARSA, the future component of the target is self.Q_sa[next_state[0], next_state[1], self.epsilon_greedy_policy(next_state)]. On the other hand, the future component of the target when using Q-learning is self.Q_sa[next_state[0], next_state[1], np.argmax(self.Q_sa[next_state[0], next_state[1], :])].

We can train an agent to learn via SARSA or Q-learning with just a one-liner .sarsa_control() or .q_learning_control().

agent1 = TemporalDifference(env)
agent1.sarsa_control(10000)
sns.heatmap(
    np.max(agent1.Q_sa, axis=2), annot=True, cmap='coolwarm', fmt='.1f'
)
agent2 = TemporalDifference(env)
agent2.q_learning_control(10000)
sns.heatmap(
    np.max(agent2.Q_sa, axis=2), annot=True, cmap='coolwarm', fmt='.1f'
)

To apply Expected SARSA, you will have to make a slight modification to the code above. Instead of self.Q_sa[next_state[0], next_state[1], self.epsilon_greedy_policy(next_state)], you have to replace it with a weighted average of Q_sa, according to the action probabilities of the policy being followed. Notice that this will lead to better stability, because the impact of an exploratory action leading to outsized rewards would be minimized.

I will briefly comment on Section 6.6 of the textbook, without dedicating a subtitle for it. The authors talked about some tasks or games, such as tic-tac-toe, being more appropriately treated in a special manner. In these tasks, the outcome of taking an action is known with full certainty, and it would be more efficient to learn the ‘afterstates’. (Note that this is different from saying that the environment transition dynamics are known) This is because different pairs of position-move can produce the same outcome, and thus should have the same value.

If you want to try out examples 5.5 and 5.6 of the textbook yourself, I have provided a generalized environment here. Just modify self.grid and self.wind accordingly.

Conclusion

In Chapter 6 of the textbook, you acquired the knowledge of how to implement Temporal Difference learning. You can now perform on-policy or off-policy updates, without the need to have knowledge of the environment transitions, in an online manner. In the next article, we will see how TD can be further generalized!

The concepts here can be extended to solve a wide variety of RL problems, and many powerful algorithms would inevitably have a component within that uses TD.

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.