Jul 25, 2017 · 1 min read
I am playing with my own DQN. I have a problem of delayed rewards, i.e., agent should see a sequence of states and made several actions (≥ 1) to arrive at getting a reward.
So I build sequences of s-a tuples that reached the reward. However my training examples do not cover the whole distribution, for the agents does not learn on sequences for which the reward is far ahead. What would you do in this situation?
Thanks for a great presentation!
Alexey