Fast Implementation of Self-Imitation Learning

Wah Loon Keng
4 min readJun 22, 2018

--

Self-Imitation Learning (SIL) is a new reinforcement learning method that improves on any actor-critic architecture https://arxiv.org/abs/1806.05635. The imitation-learning part is a add-on training component:

  1. a replay memory that collects experience from the original actor-critic on-policy memory. In essence they can be the same storage mechanism, but the replay sampling used for imitation learning is different, as it uses random minibatch sampling, whereas the on-policy memory samples entire trajectories and clears its buffer after an on-policy training.
  2. an off-policy training loop that runs after the actor-critic on-policy training loop. Sample random minibatches from the replay memory above, and train the actor-critic network for some epoch and iterations per batch.
  3. an imitation-learning loss which can be thought of as a clipped advantage: max(returns - predicted_value, 0). The clip is so that if an action actually produces a better-than-expected value, it should be used for training, hence the self-imitation since it is learning from its own good samples. The actor’s policy loss is simply the log-prob times that clipped advantage, and the critic’s value loss is just the mean-squared loss of the clipped advantage.

That’s all. The paper is extremely clear and simple, but the result is powerful. SIL method can be applied to any actor-critic algorithms; in the paper they apply it to A2C and PPO. SIL actually addresses the major problem of exploiting previous good explorations, by learning from them more using off-policy training. In the paper, SIL solved Montezuma’s revenge, which is the canonical example of a hard-exploration problem.

Next, implementation. I want to show one of the reasons we built SLM-Lab, a Modular Deep Reinforcement Learning framework in PyTorch.

Implementing deep RL algorithms have often been laborious ever since I stumbled upon it 2 years ago. Most of the code I find out there rewrite too many things from scratch, and there are a lot of components for deep RL. The result is often one-off code that does not get reused for new algorithms, or even get put to used to solve different environments.

Implementation should reuse as much components as possible. New work should build on top of the previously tested components. One should write code for only the new component of interest. Workflow should be fast, and one should never have to start from scratch.

SIL is an extension of actor-critic, and it should be implemented as such. So, on the same day reading that paper, I went home and got to work. Below is the code, corresponding to the components outlined above:

  1. replay memory, taking experience from existing on-policy replay, and allowing for minibatch sampling

2. off-policy training loop

3. imitation-learning loss

That is pretty much the gist (no pun intended). Implementing SIL took me only 4 hours, and it could run on:

  1. any OpenAI Gym or Unity ML environments,
  2. plain feedforward network, Recurrent network, Convolutional network,
  3. shared or separate actor-critic architecture
  4. multiple A2C variants (using GAE, N-Step returns, entropy term)

You can see all the code changes to implement SIL on the Github pull request, and of 700+ lines of edits, only ~200 are python code (the rest are JSON specs to run experiments on the variants above).

Granted, the code above tie in to a lot of functions predefined elsewhere. But that is precisely the point. We took the time to build the foundation of SLM-Lab, and now we reap its benefits:

  1. maximum reuse of well-tested components: all the interactions with environments, network construction, memory management, training loop, etc. are handled via API. We do not touch those code. We also reuse the existing replay memory.
  2. clarity and focus of implementation: the actor-critic code is already solid and trustable, so inherit from it. We need only to focus on the SIL-specific logic. Also, shorter code is clearer.
  3. minimal labor: our mental energy and time are limited, so when something is laborious, we can’t do a lot of it. However, when something becomes as fast and easy as this, we can do a lot of it. Writing a 200 lines of code (for SIL) is definitely more desirable than writing a few thousand lines of code (implementing from scratch).

Engineering is a bottleneck in deep RL research, but it does not have to be that way. SLM-Lab also contains a lot of other algorithms, all written naturally as extension of the simpler algorithms. I could list the components below, but it would take up some space.

So, fellow researchers and enthusiasts, I encourage you to check out the Github repo. The links to the documentation page and deep RL experiment log book are there as well. If you find it neat, and agree with my point above, please spread the love and tell your buddies about it. Friends don’t let friends rewrite duplicate codes.

--

--