Reinforcement Learning Foundation Stone — Temporal Difference Learning

Saurabh Kishore Tiwari
The Thought Mill
6 min readMay 21, 2024

--

Temporal Difference Learning (TD Learning) is one of the main ideas behind reinforcement learning.

In this article, we’ll first discuss Temporal Difference Learning and then explore how SARSA is different from Q-Learning.

Bellman Equation And Markov’s Decision Principle

To understand what is TD, let’s first understand what is Bellman Equation

The Bellman equation is a fundamental concept in reinforcement learning and dynamic programming. It provides a recursive decomposition of the value function, breaking down the value of a state into the immediate reward and the value of the subsequent states.

This is denoted by

where
s is the current state
s’ is the new state
V(s) is the value of the current state
R(s, a) is the reward of transitioning from state s using action a
V(s’)
is the value of new state
gamma(γ) is the discount

But there is something missing. Discretion.

We know that there are two types of algorithms, Deterministic and Non-deterministic. If you’ll carefully look at the above algorithm, you’ll find that it is somewhat deterministic. The decision to enter into a state is predetermined.

Markov’s Decision Principle(MDP) brings in the Non-deterministic nature of the algorithm. It is a mathematical framework for decision making in conditions where decisions are partly random and partly
under the control of decision maker

The above formula becomes,

P(s, a, s’) is the probability of moving to state s’ from state s using action a.

Bellman’s equation is a special version of the above equation where ∑ P(s, a,s’) = 1.

Q-Learning

From wikipedia

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence “model-free”), and it can handle problems with stochastic transitions and rewards without requiring adaptations.

For any finite Markov decision process, Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. Q-learning can identify an optimal action-selection policy for any given finite Markov decision processes, given infinite exploration time and a partly random policy. “Q” refers to the function that the algorithm computes — the expected rewards for an action taken in a given state.

So, if you look closely, there are similarities between Bellman Equation with MDP and Q-Learning Formula.
If you replace V(s’) with Bellman Equation, as an end result you’ll get,

Temporal Difference Learning

TD learning methods are used to estimate value functions for states or state-action pairs and to learn optimal policies by iteratively improving these value estimates based on the experience gathered from interacting with the environment.

In the above formula, you can see, how to derive new Q value using the old Q value.

alpha(α) is called the learning rate. The rate at which your algorithm is learning.
So now, the new formula becomes

So, you can now think of the Bellman equation in terms of Q-value.

TD Control

TD control (Temporal Difference control) refers to a set of algorithms in reinforcement learning that aim to find an optimal policy for a Markov Decision Process (MDP) using temporal difference methods. These algorithms improve the policy iteratively by learning the value of actions (action-value function \(Q(s, a)\)) while the agent interacts with the environment.

Key Concepts in TD Control

1. Policy:
A policy π(a | s) is a mapping from states (s) to probabilities of selecting each possible action (a).
In TD control, we typically start with a given policy and improve it over time.

2. Action-Value Function:
The action-value function Q(s, a) represents the expected return (cumulative future reward) of taking action (a) in state (s) and following the policy thereafter.

3. Exploration vs. Exploitation:
To learn an optimal policy, the agent must balance exploration (trying new actions to discover their effects) and exploitation (choosing actions that are known to yield high rewards).

Two Main TD Control Algorithms

1. SARSA (State-Action-Reward-State-Action): an on-policy TD control algorithm. It updates the action-value function Q(s, a) based on the action actually taken by the current policy.
The update rule for SARSA is:

Credits — https://images.app.goo.gl/eSJWeGcCMYfdEYKB9

2. Q-Learning: an off-policy TD control algorithm. It updates the action-value function Q(s, a) using the maximum reward of the next state, regardless of the policy being followed.
The rule for Q-Learning is:

Comparison of SARSA and Q-Learning

Practical Use of TD Control

Exploration Strategies:
Common exploration strategies include \(\epsilon\)-greedy (choosing the best-known action with probability \(1 — \epsilon\) and a random action with probability \(\epsilon\)) and softmax (selecting actions based on a probability distribution related to their value estimates).

Convergence:
Both SARSA and Q-Learning converge to the optimal action-value function under certain conditions, such as sufficiently decreasing learning rates and adequate exploration.

Conclusion

In summary, Temporal Difference (TD) Learning stands as a cornerstone of reinforcement learning, combining the strengths of dynamic programming and Monte Carlo methods to enable efficient, online learning of value functions and policies. The Bellman Equation and Markov Decision Process (MDP) framework provide the theoretical underpinning for TD learning by defining how value functions can be decomposed into immediate rewards and future values, and how probabilistic transitions between states can be managed.

TD learning methods, including SARSA and Q-Learning, play a pivotal role in reinforcement learning. SARSA, an on-policy algorithm, updates value estimates based on the actions actually taken, making it sensitive to the policy being followed, including exploration strategies. In contrast, Q-Learning, an off-policy algorithm, aims directly at learning the optimal policy by updating values using the maximum possible future rewards, irrespective of the current policy’s actions.

Ultimately, the integration of TD learning with other reinforcement learning techniques and its applications in various domains underscore its foundational significance in the field. As we continue to develop more sophisticated algorithms and models, the principles of TD learning will remain central to advancing our understanding and capabilities in artificial intelligence and machine learning.

--

--