Hey Arthur,
Thanks for the awesome post! You explained the theory really well! But I do have to agree with Sung here though… you glossed over the actual algorithm; the most particular, complex, and crucial part of the algorithm too, 1). Even your explanation in your response to Sung felt sort of… hand waived.
Would you mind providing a more detailed explanation of why 1) is similar to a gradient descent approach? A gradient descent algorithm decreases the step size of the slope until the abs(slope) is below pre-defined precision threshold. I’m not seeing a threshold condition in on Q[s,a] in 1)??
EDIT:
Just had that “aha” moment.
The best explanation I can give is if you re-write the 1) as:
(1-lr)*Q[s,a] + lr*(r + y*max(Q[s1,:]))
You can think of it as you are “weighting“ future vs current rewards by 80 vs 20 (since lr = 0.8).
So with each subsequent step, you’ll be favoring (r + y*max(Q[s1,:]))
