[Paper Notes 2] Self-Imitation Learning

Flood Sung
IntelligentUnit
Published in
3 min readJul 10, 2018

Paper Link: https://arxiv.org/abs/1806.05635 ICML 2018

Author’s Website: https://sites.google.com/a/umich.edu/junhyuk-oh/

Code: https://github.com/junhyukoh/self-imitation-learning

1 The Idea

Exploitation and exploration is one of the fundamental challenges for Reinforcement Learning. Some papers consider to design an auxiliary reward (curiosity driven approaches) for better exploration, and some other papers consider to add a entropy regularized loss item to encourage exploration. This paper proposes a very simple but effective method to deal with this problem.

The idea of this paper is pretty intuitive, that is to directly use past good experiences to train current policy. When I read this paper at the beginning, I thought this idea was directive and easy to understand. It is similar to prioritized replay buffer but use good replay samples to train the policy (imitation learning). However, when I found the effectiveness of this idea, I felt:” Wow! What a crazy idea! How can it also improve exploration?”

The hypothesis of this paper is somewhat unreasonable:Exploiting past good experiences can indirectly drive deep exploration. However, they do some theoretical analysis in the paper and shows that the objective of self-imitation learning is the same with a lower bound version of soft-Q learning. Soft-Q learning is an entropy regularized reinforcement learning algorithm which encourages exploration (diverse action choices at every state). Self-imitation Learning uses a totally different approach (without using the entropy loss item) but achieves the same effect.

So how to intuitively understand this algorithm?

I think the reason why exploitation can improve exploration is just that with exploitation, we can turn the game to next-stage more easily (but not still explore at the early stage while the policy has already solved), and then continue to do exploration. Therefore, as the paper mentioned in the conclusion section, we can combine self-imitation learning with other exploration methods, which should be able to achieve a better result.

2 The Algorithm

The algorithm of self-imitation learning is straight-forward by adding a self-imitation learning part in existing reinforcement learning algorithms like A2C.

  1. Add a Prioritized Replay Buffer. The difference of self-imitation learning’s replay buffer is that it calculates and stores final returns and uses returns as priorities to sample replay.
  2. self-imitation learning objective:

The principle of self-imitation learning is to do imitation by using past good experiences. Therefore, a very simple idea to achieve this is to use experiences which real returns R are larger than estimated values V. In the implementation, only such experiences do gradient updating to the network parameters. Otherwise, the loss would be 0 and no parameters would be changed.

3 Summary

This paper proposes a very innovative and simple method to do better exploration for reinforcement learning and achieves some promising results on very hard exploration tasks like Montezuma’s Revenge. It inspires us to think of exploration in a totally different perspective. It also inspires to use better experiences for other reinforcement learning problems like Hierarchical reinforcement learning and goal-conditioned reinforcement learning, maybe we can have a try.

--

--