Hello Arthur,
Thanks for your terrific posts on RL! I’m a bit confused by the loss taking the form -log(responsible_weight) * reward_holder. In particular, I don’t agree that log(responsible_weight) corresponds to log(policy) here, thus violating the policy gradient theorem. If we instead put a softmax( ) around weights, and instead of eps-greedy action selection we sample from that softmax layer, then I would agree that the loss above follows the policy gradient theorem. Now, I’ve tried both your eps-greedy approach and my own suggestions, both of which work, but I’d be happy to hear your thoughts on this. Does this have to do something with off-policy learning, and if so, what theorems from off-policy literature justifies the loss you are using?
Thank you!