Deployment-time Adaptation of RL agents without any reward

Overview of the paper “Self-Supervised Policy Adaptation during Deployment” by N Hansen et al.

Published in

deepgamingai

3 min readJul 18, 2020

One big limitation of image-based reinforcement learning agents is that their performance drops a lot if significant visual changes are made to the input from training time to deployment time, even if the core task remains the same. This is not something us humans struggle with so much. For example, if you learn very well how to play one particular game, you can very easily play another game of the same genre even if you are playing it for the first time.

So today, I want to share a paper that tries to tackle this problem and introduce adaptability to RL agents so they too can play multiple games with different graphics just like humans. It is titled “Self-Supervised Policy Adaptation during Deployment” and it introduces a method to adapt our RL agent to deployment environments that happen to be different from our training environment. This promises to make it much easier to train RL game agents on a specific environment and then port them over to different games that belong to the same genre.

Let’s take a look at how this method adds deployment-time adaptation where there are no reward signals available to fine-tune the RL agent’s learned policy.

It separates the usual controller policy network into two parts, one part is responsible for learning useful latent representations from the pixel inputs, and the other part learns which actions to take based on these latent representations. So, during training, the agent learns from the reward signal in the RL training loop, and there is also an additional network in the mix here that learns the transition sequence of the intermediate latent representations using self-supervised learning.

Self-Supervised Task is to classify rotation of the input image.

Here, the self-supervised task used is as simple as rotating the image by a certain angle like 90, or 180, and trying to identify the rotation angle as a supervised classification task. Now, here comes the key part. During deployment, we do not have the reward signal, but we can still fine-tune our network by using the same self-supervision task. This allows our latent representation to adjust to the visual changes in the input image, while still allowing us to obtain our learned controller actions from this latent.

Left is the training environment and rest of the tiles are unseen testing environment for bipedal motion. [source]

The results shown in the paper demonstrate a clear improvement in performance obtained by using this method.

Application to Gaming

This line of research will make it much more practical to train game bots with reinforcement learning in real life since we can train them to play one game and use the same agent to play other games of same genre with simple self-supervised adaptation. This makes game AI one step closer to being more human like in the future.

Useful Links

Thank you for reading. If you liked this article, you may follow more of my work on Medium, GitHub, or subscribe to my YouTube channel.

Deployment-time Adaptation of RL agents without any reward

Overview of the paper “Self-Supervised Policy Adaptation during Deployment” by N Hansen et al.

Application to Gaming

Useful Links

Written by Chintan Trivedi