Summary: Prioritized Experience Replay

Ideas from this summary are taken from the Prioritized Experience Replay Paper.

Online reinforcement learning agent perform updates while they observe experience. This brings up two issues: strongly correlated updates don’t follow i.i.d. assumptions of popular SGD algorithms and forgetfulness of experience that could be useful later on. Experience replay makes it possible to break free of the temporal correlations between samples found in online learning agent’s updates.

With prioritized replay our agent is able to replay experiences at a different frequency than what they originally were collected. In the case of this paper experiences that had larger temporal difference errors are thought to have more value and will be replayed more frequently. The temporal difference error can be thought of as similar to how surprising our agent found a transition. So if our agent is quite surprised by the outcome of a transition it suggests to us that the agent might have a good deal to learn from it. However, it should be noted that this is not always the case since transitions can be noisy.

Drawbacks of Greedy Prioritization

Transitions that have a low temporal difference error on their first pass have a low probability of being sampled again. Which is effectively never, since we are using using a circular memory buffer and the experience will eventually fall out. Greedy prioritization focuses on a small subset of experiences while not spending much time on the frequent low TD error transitions and makes the agent susceptible to over fitting.

Stochastic Prioritization

Stochastic prioritization is introduced to overcome the drawbacks of greedy prioritization and puts us somewhere in between greedy and uniform sampling. Probability sampling transition_i is given below:

Rank is the location in replay buffer sorted from high to low by TD error and alpha determines how much prioritization is used(alpha=0 is same as uniform).

Annealing Bias

Prioritization introduces bias into our system since it is changing the distribution in an uncontrolled fashion. This is corrected for with weighted importance sampling

Unbiased nature of updates is most important near convergence. Beta is linearly annealed towards 1 throughout training. When beta is equal to one the important sampling weights fully compensate for the non-uniform P(i)

Conclusion

An alternate prioritization metric is presented in the paper called proportional prioritization but is thought to perform worse(in practice actually worked well) because the rank based approach was expected to be more robust towards outliers. Prioritized replay was able to speed up learning by a factor of 2 in the benchmarks tested on in the paper.

the black “uniform” line is from double dqn’s performance