We saw in the previous article about how to solve the FrozenLake environment available in OpenAI gym toolkit using Q-learning. In this article, we’ll solve it using the SARSA algorithm.
SARSA is an on-policy algorithm where, in the current state, S an action, A is taken and the agent gets a reward, R and ends up in next state, S1 and takes action, A1 in S1. Therefore, the tuple (S, A, R, S1, A1) stands for the acronym SARSA.
It is called an on-policy algorithm because it updates the policy based on actions taken.
SARSA vs Q-learning
The difference between these two algorithms is that SARSA chooses an action following the same current policy and updates its Q-values whereas Q-learning chooses the greedy action, that is, the action that gives the maximum Q-value for the state, that is, it follows an optimal policy.
The algorithm for SARSA is a little bit different from Q-learning:
Basically, the Q-value is updated taking into account the action, A1 performed in the state, S1 in SARSA as opposed to Q-learning where the action with the highest Q-value in the next state, S1 is used to update Q-table.
Now, let’s see the code for SARSA to solve the FrozenLake environment:
If you have gone through my previous article on solving the FrozenLake environment using Q-learning, you’ll see that this code is similar to it.
Now, let’s dissect it:
On line 38 and 39, an action is chosen for the initial state.
In the SARSA tuple, we have now
Then, this action is taken in the environment and the reward and next state are observed on line 44.
Now the tuple has,
(State, Action, Reward, State1)
On line 46, an action is chosen for the next state using the choose_state(…) function.
The action chosen by the choose_action(…) function is done using the epsilon-greedy approach.
Now, the tuple is complete as,
(State, Action, Reward, State1, Action1)
On line 48, the learn(…) function updates the Q-table using the following equation:
We can see as opposed to Q-learning update equation where the max of Q(S’, a) is taken, in SARSA update equation the Q-value is chosen using S’ and A’, the next state and the action chosen in the next state, respectively.
And all the other code is similar to the Q-learning code we saw in the previous article.
Try to tweak the different parameters to get better results.