N-step TD Method
The unification of SARSA and Monte Carlo Simulation
In previous posts, we have been together explored some general reinforcement learning methods, including SARSA, an on-policy updates, where the Q value is updated based on the trajectory the agent takes, and Monte Carlo method, which is generally used to estimate a policy. In this post, we will
- recall SARSA and Monte Carlo method
- explain why these two methods can be unified(in fact, they are the same method with different parameters)
- implement random walk example and compare the effectiveness of different n-step methods
SARSA & Monte Carlo Simulation
Remember that in 1-step SARSA, we updated the current Q value estimation with:
Q(S, A) += lr * (R(S, A) + Q(S', A') - Q(S, A))
which indicates that the estimation of current state is affected only by the estimation of next state, and this is why it is also called 1-step update method. With 1-step SARSA, there comes 2-step, 3-step SARSA and so on, where the current update could also depend on states that are 2, 3 … steps away. And when the number of steps approximate to infinity, this is essentially Monte Carlo simulation.
Monte Carlo methods perform an update for each state based on the entire sequence of observed rewards from that state until the end of the episode. The update of one-step TD methods, on the other hand, is based on just the one next reward, bootstrapping from the value of the state one step later as a proxy for the remaining rewards.
For example, in a game playing setting like blackjack, we applied Monte Carlo simulation and got result, which is either win, draw or lose, only to the end of the game playing, so it is actually using the final state to update the current state, and that is why this is also a n-step TD method with n approximating to infinity.
The n-step TD Method
Let’s directly take a look at the algorithm and implement it with a concrete example.
It might be a little tricky to understand the algorithm, let me explain with actual numbers. The lowercase
t is the timestamp the agent currently at, so it starts from 0, 1, 2 … until the end of game.
τ is the timestamp of Q value that being updated, say, if
n=3 , which is 3-step TD method, current
t=5 , then
τ=t-n+1=5-3+1=3 , which means when the agent reaches timestamp 5, the Q value of timestamp 3 will be updated. And the update is simply picking up reward along the way(multiplied by the decaying factor
γ) , and use the temporal difference of
G — Q(S_τ, A_τ) to update the current estimation(this part is exactly the same as what we talked in SARSA). The
τ+n<T part is specifying situation, say, if the end state is
T=10 , which means the agent takes 10 steps to reach the end of game, then at the end timestamp
t=10 , state
τ=10-3+1=8 will be updated, but what about the state of
τ=9 ? The algorithm says that states in between will be updated only by the reward collected along the way.
You might wonder why the updates are all in ordered fashion, while the examples we talked before are always updating Q values till the end of game and backpropagate the reward. In fact, the ordered update is more generalised form, considering situations that are not game playing and there are no end of the game, say, stock market investment where we care about long term total reward and there is not an obvious end of episode. Both ordered update and reversed update are able to converge when the agent takes enough number of trials, while the reversed update is generally more effective in terms of the reward collected at the end of episode is mostly non-zero value, thus the updates along the way will not be zero.
Random Walk Implementation
Let’s implement an example and see how different number of steps of TD method affects the agent’s learning.(full code)
In this game setting, all episodes start in the center state,
C, then proceed either left or right by one state on each step, with equal probability. Episodes terminate either on the extreme left or the extreme right. When an episode terminates on the right, the agent is rewarded
+1 , when it terminate on the left, the agent is rewarded
-1 , and all other rewards are
We initialise a game with 19 states(the start and end state is not included!). In the
__init__ function, we, as usual, initialise actions and Q values, and the Q values of ending states are explicitly set to 1 and -1 in order to help the agent converges faster.
self.n is the number of steps.
The agent chooses actions
right with equal probabilities, and by taking the action the index is either added by 1 or subtracted by 1.
The n-step SARSA updates is exactly the same as we saw in the algorithms shown above. Before each episode, we initialise
rewards to keep track of the agent’s behaviour along the way in order to do n-step updates. Only when
tau == T-1 can the updates stop, as the last state
T will not be updated.
Comparison of different N
By running the algorithm with different learning rate and steps(n), we are able to compare the performance of our agent.
actual_state_value for a n-state random walk is equally increased from -1 to 1, and because the agent choose actions with equal probability, the
estimate_state_value equals the expectation of Q values in that state.
From the result, we can see that generally the best n of steps lies in between 1 to infinity.This illustrates how the generalisation of TD and Monte Carlo methods to n-step methods can potentially perform better than either of the two extreme methods.
And lastly, please check out the full code here. You are welcomed to contribute, and if you have any questions or suggestions, please raise comment below!