Temporal Difference Learning: SARSA vs Q-Learning

One of the most exciting fields in todays machine learning environment is reinforcement learning. This article explains the difference between the most famous temporal difference learning algorithms.

Published in

CodeX

5 min readFeb 16, 2022

A basic RL problem-set consists of an environment and an agent. The agent can perform certain actions in the environment resulting in some type of reward. The agent is not told which actions to take, but must instead discover which actions yield the most reward by trying them.

When the environment dynamics are already known, the problem becomes fairly easy. The agent can come up with the optimal policy based on some simple dynamic programming concepts, because no exploration is needed. This is known as model-based, where the agent bases its actions on the model of the environment. However, when the environment dynamics are unknown, the problem becomes much harder. This is called model-free, where there needs to be a balance between exploration and exploitation (more on this later). This is where temporal difference learning comes in.

Temporal difference learning

Temporal difference learning (TD) is a class of model-free RL methods which learn by bootstrapping the current estimate of the value function. In order to understand how to solve such problems, and what the difference is between SARSA and Q-Learning, it is important to first have some background knowledge about key concepts.

Q-table

The agent must base its actions on previously experienced rewards. It does this by keeping a table with Q-values. It stores each Q-value for a state-action pair. How the agent updates the Q-value is where the difference between SARSA and Q-learning lies. Depending on the action selection strategy, it can choose the action that results in the highest expected reward.

Exploration & exploitation

In order for the agent to develop an optimal policy, it needs to make a trade-off between exploitation and exploration. The goal of an agent is to take actions that maximise the long-term reward. However, the agent doesn’t know which actions maximise the reward because environment dynamics are unknown. To discover these actions, it must try actions that it has never selected before. So there is a consideration between exploring new actions or exploiting rewards from familiar actions. This is where action selection strategies come into play.

Action selection

Action selection is the strategy where the agent bases its selection of actions on. The most basic strategy is the greedy strategy, which always goes for the highest reward. In other words, it always exploits the action with the highest estimated reward. However, chances are that this action selection strategy overlooks possible better actions.

Another strategy is ε-greedy (pronounced as epsilon-greedy). ε-greedy is a strategy that balances exploration and exploitation based on the epsilon parameter. Instead of always selecting the action with the best estimated value from the Q-table, it gives a small probability (the epsilon) to select possible actions uniformly and randomly. This makes sure the agent doesn’t overlook actions that potentially result in higher rewards.

On-policy & off-policy

There are two different TD algorithms, namely on-policy and off-policy. On-policy uses the same strategy for both the behaviour and target policy. Off-policy algorithms use a different strategy for the behaviour and target policy. It is a bit tricky to wrap your head around, but it will become more clear later on.

Parameters

Both algorithms need the following parameters:

The epsilon (ε): the chance that an agent chooses an exploration step
The learning rate (α): the rate at which the agent learns from new observations
The discount value (γ): the amount future rewards get discounted because direct rewards are more important than future rewards

SARSA

SARSA is an on-policy learning method, as it uses an ε-greedy strategy for all the steps. It updates the Q-value for a certain action based on the obtained reward from taking that action and the reward from the state after that assuming it keeps following the policy. The formula that updates the Q-value is as follows:

Intuitively, the current Q-value gets added to the reward from taking the action and the difference between the current Q-value and the discounted reward from the following state. After each iteration of the time step, the next action is used to perform the following action (following the policy).

When implementing such a agent from scratch in Python, it will look something like this:

SARSA agent

Q-Learning

Q-Learning is an off-policy learning method. It updates the Q-value for a certain action based on the obtained reward from the next state and the maximum reward from the possible states after that. It is off-policy because it uses an ε-greedy strategy for the first step and a greedy action selection strategy for the second step.

Intuitively, the current Q-value gets added to the reward from taking the action and the difference between the current Q-value and the maximum reward possible from the following state. After each iteration of the time step, the next action is NOT used to perform the following action (not following the policy).

When implementing such a agent from scratch in Python, it will look something like this:

Q learning agent

Performance difference

Q-learning directly learns the optimal policy because it maximises the reward with a greedy action selection strategy. This removes the chance that the agent uses an exploration step from the second step in de update function. SARSA can use an exploration step in the second step, because it keeps following the ε-greedy strategy. Because of this, Q-learning will converge faster to an optimal policy than SARSA. However, specific details about the performance difference really depend on the environment.

Conclusion

As you can see, SARSA and Q-Learning are very similar to each other. Both methods are used to find the optimal policy in an unknown environment. SARSA is on-policy and Q-Learning off-policy. The main difference is in how the methods update the Q-value:

SARSA updates the Q-value using the Q-value of the next state and the policy’s next action
Q-Learning updates the Q-value using the Q-value of the next state and the greedy action after that

This means that SARSA uses the next action as a starting action in the next state (policy), where Q-Learning replaces it with the maximisation of the next action’s reward.