Michael Covello
Aug 25, 2017 · 1 min read

Hey Arthur,

Thanks for the awesome post! You explained the theory really well! But I do have to agree with Sung here though… you glossed over the actual algorithm; the most particular, complex, and crucial part of the algorithm too, 1). Even your explanation in your response to Sung felt sort of… hand waived.

Would you mind providing a more detailed explanation of why 1) is similar to a gradient descent approach? A gradient descent algorithm decreases the step size of the slope until the abs(slope) is below pre-defined precision threshold. I’m not seeing a threshold condition in on Q[s,a] in 1)??

EDIT:

Just had that “aha” moment.

The best explanation I can give is if you re-write the 1) as:

(1-lr)*Q[s,a] + lr*(r + y*max(Q[s1,:]))

You can think of it as you are “weighting“ future vs current rewards by 80 vs 20 (since lr = 0.8).

So with each subsequent step, you’ll be favoring (r + y*max(Q[s1,:]))

)

    Michael Covello

    Written by

    #art #code #linux #piano #zen