OPTIMAL or SAFEST? The brief reason why Q-learning and SARSA choose different path in cliff walking problem

Given that you need to travel from point A to point B. Would you choose the optimal but the most dangerous path? Or would you rather choose the safest but the most time-consuming path?

Tisana Wanwarn

Published in

Geek Culture

5 min readSep 3, 2021

Explore vs Exploit

In the context of reinforcement learning, exploitation is when agent choose best action while exploration is when agent act randomly in order to explore whether is there any other better way(s) to reach the objective.

What is policy?

Policy in reinforcement learning refer to the way to decide what action to take. The most common policy in reinforcement learning is epsilon-greedy where epsilon refer to the probability of exploration. Therefore the epsilon-greedy policy will enable the agent to perform exploration for epsilon percent of the time and perform exploitation for 1-epsilon percent of the time.

Why SARSA and Q-learning think differently?

First lets talk about main similarity. Both SARSA and Q-learning take some action, receive immediate reward, and observed new state in the given environment in order to learn action-value function or the Q-value in Q-table. Q-table have the dimension of number of actions by number of states where the Q-value is how good the action is given the state.

The only difference that makes SARSA and Q-learning decide differently is that, SARSA use on-policy approach while Q-learning use off-policy approach. Given the policy is epsilon greedy, off-policy is when the agent does not learn the action-value function from the policy while on-policy is when the agent does learn the action-value function from the policy.

The update equation of the Q-value for current state and current action of Q-learning is based on the best action of the next state denoted with the max function which does not take into account the epsilon-greedy policy at all.

While the update rule for SARSA does not based on the best action of the next state but based on the action decided by epsilon-greedy policy. This is why SARSA is called on-policy which make both approaches act differently.

The Cliff Walking problem

In the cliff problem, the agent need to travel from the left white dot to the right white dot where the red dots are cliff. The agent receive reward of 10 when reach the goal and receive punishment of -100 if the agent fall off the cliff. The longer the agent travel in the grid, the more the punishment the agent will get which is -1 for each grid.

The Q-learning agent is represent by green line while SARSA agent is represent by blue line. Full python code can be found at the end of the article.

No doubt about why Q-learning choose optimal path since the approach only learn the optimal action.

But the reason that SARSA took safest path is because the policy that drive the learning of action-value function of SARSA is the epsilon greedy where epsilon percent of the time the agent took random walk. This mean, the learning is driven by epsilon policy that most of the time walk randomly, it is not safe at all to walk close to the cliff in order to avoid the big punishment of agent falling off the cliff by accepting the small punishment of long traveling instead.

This mean that the more the reduction in value of epsilon, the more the path chosen by SARSA is closer to the cliff.

If we assign epsilon to be 0.8, this mean 80 percent of the time the agent will perform random action.

SARSA took safest path while Q-learning took optimal path (My screen shot)

This is why SARSA that learn from the policy try to stay away from the cliff to prevent the huge negative reward as much as possible as its policy will take random move 80 percent of the time. Even though the longer the agent travel in the grid the more negative reward (-1) the agent will get, it is still too risky to have “epsilon greedy policy driven” agent to stay close to the cliff since it has a lot of chance to fall off.

Then, if we reduce the degree of exploration to be 0.2, this mean only 20 percent of the time the agent will perform random action.

SARSA’s path move closer to optimal path (My screen shot)

As a result, the path that SARSA chose move closer to the cliff, closer to the optimal path, as the policy now does not take that much random move compare to before.

However, if we set epsilon to ZERO, as you might be able to guess…

SARSA took the same path as Q-learning as the policy that SARSA learn from now is the optimal policy not the epsilon greedy anymore.