Q-Learning, Expected Sarsa and comparison of TD learning algorithms

The Perceptive Agent
Analytics Vidhya
Published in
4 min readMay 11, 2020

This article continues upon the previous article where I talked about TD Learning and demonstrated the working of the SARSA algorithm. Building upon that, let us talk about two more TD based algorithms: Q-Learning and Expected Sarsa

Q-Learning

One of the early breakthroughs in RL was the development of an off-policy TD control algorithm knows as Q-Learning, defined by:

Q-Learning update equation. (Source: Reinforcement Learning: An Introduction by Sutton and Barto)

Unlike the Sarsa algorithm, the Q-Learning update equation uses greedy-action selection for the next state to calculate the TD error. This lends to the off-policy nature of this algorithm because irrespective of the policy being followed (ϵ-greedy etc.), the TD error will always use the maximum Q-value pair for the next state.

Here is the Q-Learning algorithm. Everything is the same as that of Sarsa except the minor change in the update equation.

Q-Learning in Python

Using the same Gridworld environment as in the previous article, I implemented the Q-Learning algorithm. A small change that I made is that now the action-selection policy is ϵ-greedy instead of using a fixed ϵ. Here is the Q-Learning agent:

Before showing some results for the Q-Learning algorithm, let us look at another algorithm. Then, I will execute all 3 algorithms and compare them.

Expected Sarsa

Instead of taking the maximum over the next state-action pairs, it uses expected value, taking into account how likely each action is under the current policy.

Expected Sarsa update equation.

Given the next state Sₜ₊₁, this algorithm moves deterministically in the same direction as Sarsa moves in expectation, and hence, it is called Expected Sarsa. It is more computationally complex than Sarsa but in return, it eliminates the variance due to random selection of Aₜ₊₁.

Comparison of Sarsa, Q-Learning and Expected Sarsa

I made a small change to the Sarsa implementation and used an ϵ-greedy policy and then implemented all 3 algorithms and compared them using their training progress and average scores. Here are the parameters used in all 3 algorithms:

  • num_episodes:100
  • random_seed: 10 (this makes the random actions consistent while testing the various parameters)
  • gamma (discount factor): 0.99
  • alpha (update size): 0.6
  • Initial epsilon: 1
  • epsilon decay factor: 0.9 (after every episode, the epsilon is reduced to 0.9 times of the previous epsilon)
  • epsilon minimum: 0.1 (epsilon will never be reduced to less than 0.1 so as to facilitate minimum exploration even in the later episodes)

Here is the python script where all 3 algorithms are executed on the Gridworld environment and compared:

The results

Note that the extra characters after the epsilon values are some issue due to printing on the same line while training. The system basically overwrites each line and in the later episodes, the score is reduced and the average also has a lesser number of characters ultimately leading to someextra characters

I ran the tests with 10 different values of seeds and over all the tests, I observed the following:

  • Expected Sarsa gave the highest average score in 6 out of 10 episodes and in other 4, was at the 2nd position in terms of average score.
  • Q-Learning was in the second position in terms of average scores and consistently the fastest to reach the optimal path.
  • Sarsa was neither the fastest, neither the best as compared to the other two in terms of my tests. Also, in 2 out of 10 tests, it even failed to find the optimal path.

This graph tells a really interesting story. It tells that even though Expected Sarsa was off to a bad start (that sharp dip of the green line at the beginning), it was the quickest to stabilize. We see that between approximately 10ᵗʰ episode and 20ᵗʰ episode, Expected Sarsa started giving stable results while the other two were still experiencing some dips. Later too, there are 2–3 cases of dips in either Q-Learning or Sarsa’s lines.

This stability might be due to the fact that while calculating the TD error, Expected Sarsa weights the Q-value of each state-action pair with the probability of that action occurring.

It might seem that Expected Sarsa is the best algorithm from the 3 algorithms presented above but do remember that there were some cases where both Q-learning and Sarsa took the lead. Also, it heavily depends upon the parameter choice too.

That’s it for this one. See you in the next one. Keep rocking!

--

--