There should be no reason that optimizing for entropy is any more likely to lead to catastrophic forgetting compared to optimizing for reward. It could also be argued that in the case of a series of learned tasks, simply optimizing for the reward in each task would be more likely to lead to catastrophic forgetting, since then the policy would overfit to each of the sub-tasks.
Thanks for your comment. I may have put too much emphasis on the word “deterministic.” The issue with MR is not just that it is deterministic, but that there is a single fixed path which is optimal, and using a demonstration of that path is sufficient to solve the level. In the other games you mention, there is no…
You bring up a good point, which I don’t mention in the article. You are right that there is definitely a memory aspect of the game which is not present when simple feed-forward networks are used as the policy. I would love to see more research in this direction as well.
This is an interesting idea. I haven’t attempted this myself, but if you were to choose an array of different time horizons (all within a reasonable range) and vary which are used for different trajectories, I would imagine it could provide a more robust signal. I’d encourage you to experiment with it!
Could you provide a little more detail on the algorithm you used? There are a number of different ways to implement and Actor-Critic, which have different properties. For example, if your baseline is from a critic in REINFORCE, then you are using an actor-critic algorithm.
If the you are using an optimal policy, and it is deterministic, and the environment is also deterministic, then you will have zero variance and bias (unless you are using a function approximator with limited representation capacity).
You are correct. A3C typically is used in simulation environments only because it becomes infeasible to simultaneously run multiple robots at once. It is also not sample efficient enough to be used in real-world environments, since it would take literally thousands of episodes per worker to converge to a meaningful policy.