
… in a position where we have very few usefully informative trajectories to learn from. For example, if we were trying to navigate a maze to find a treat, and had accidentally learned a policy of just staying still, or spinning in circles, we would have a hard time reaching any actions that had positive reward, since we can only learn to increase an action probability by experiencing that action and seeing that it leads to reward.
This fact has consequences for the failure modes that policy gradient methods can experience. In a supervised learning setting, no matter how many ill-advised optimization steps we take, we won’t impact the distribution of samples that we’re learning over, since that’s outside of our control. By contrast, in a policy gradient setting, if we end up in a particularly poor region of action-space, that can leave us in a position where we have very few usefully informative trajectories to learn from. For example, if we were trying to navigate a maze to find a treat, and had accidentally learned a …
Expectations are defined with respect to some probability distribution, p(x), from which you expect x to be sampled, and can conceptually be described as “what you’d get if you were to sample x from p(x), and average the f(x) you got from those samples”. This is salient to bring up in the context of reinforcement learning because it helps illustrate yet ano…