Temporal-difference RL: Sarsa vs Q-learning

Introduction

Kim Rodgers
5 min readAug 7, 2023

In this blog post, I will be exploring two reinforcement learning algorithms: Sarsa and Q-learning. These two are temporal-difference(TD) methods. TD combines the ideas of both Monte-Carlo and dynamic programming. TD learns from experience like Monte Carlo but does not wait for the whole episode to end, it uses the estimate values of the next state as the target to update the value of the next state i.e it bootstraps like in dynamic programming. However, the environment dynamics are not fully known like in dynamic programming. Below is the simplest TD method update

This is called TD(0), since it only uses the value of the reward after the current action and the value of the next state as the target. The 2 methods explored here use an update similar to above, except we use state-action values instead of state values.

Differences between Sarsa and Q-learning

Sarsa stands for State-Action-Reward-State-Action. This is an on-policy method, i.e the policy being learnt is used to generate the training data. On the other hand, Q-learning is an off-policy method i.e a different policy is used to generate training data for the policy whose value function is being learnt. Below are the update functions for these 2 learning methods:

Update for Sarsa
Update for Q-learning

From the above updates above, it can be seen that these 2 methods are very similar except on how the value for the next state if determined. For Sarsa, we use the state-action after taking an action using the policy that was used to generate the state-action being updated. For Q-learning, we are taking the maximum state-action of the next state. This is Sarsa is an on-policy learning while q-learning is off-policy.

Comparing performance in cliff walking environment

This is a simple environment described in the RL book shown in the screenshot below.

The cliff-walking environment
  • An episode starts in the state S, i.e the agent starts in this state.
  • An episode ends in the state G, i.e this is the terminal state.
  • The states in the bottom most row between S and G are the cliff states
  • Transition from any state other than the cliff states has a reward of -1 and the agent moves to any of the neighboring states.
  • Transition from the cliff states has a reward of -100 and the agent moves to the start state S i.e the episode ends.
  • The episode ends when the agent reaches the terminal state G, has taken 100 steps or has ended up in a cliff state.
  • The blue path is safe but is not optimal since it takes very many steps to reach the target state.
  • The red path is optimal but it is very risky since the agent can find itself at the cliff.

From the description of the environment, the goal of the agent is to maximize the cumulative reward, i.e take the fewest number of steps as possible since each step has a value of -1. The optimal path is the one just above the cliff since it only takes 13 steps and has a value of -13. I used the 2 TD(0) methods above to determine if they are above to get the optimal path.

Learning experiment

Using the following hyper parameters during training:

  • number of episodes: 2000
  • The discounting factor: 1 i.e there was no discounting
  • Alpha: 0.1, this is the learning rate
  • epsilon: 0.1, the probability of selection all the actions with the same probability for the epsilon-greedy algorithm

The implementation of the 2 algorithms can be found in this github repo.

Learning results

  • For this experiment, both Sarsa and Q-learning take roughly the same time to converge but Q-learning was able to learn the optimum path of 13 steps. Sarsa was not able to learn the optimal path since it avoided the cliff since its update function was taking the next state-action value using epsilon-greedy, thus the states above the cliff had a low value. Since Q-learning was using the max of the next state-action value in the update, it was able to carefully move next to the edge to the target state G. The chart below shows the number of learning steps with episode. To make the chart smoother, the number of steps are averaged in groups of 20. We can clearly see that Q-learning is able to find the optimum path.
  • The chart below shows the online performance for the 2 algorithms i.e the state-values with episodes. The values are again averaged in groups of 20 as in the above chart. We can see that Sarsa has a better performance than Q-learning. This is because as Q-learning learns to get the optimal path, occasionally the it finds itself in the cliff since generation of the state action pairs to update follows the epsilon-greedy algorithm. As already shown above, Sarsa learns to avoid states close to the cliff thus reducing the chance of getting the cliff.

Conclusion

This simple example illustrates the comparison between Sarsa and Q-learning. The findings here should however not be taken as general since this experiment was very specific to the environment I was workng with.

--

--