Papers simplified: Reward learning from human preferences and demonstrations in Atari
According to Forbes, reinforcement learning is going to be one of the biggest trends in data science in 2019. With that being said, I recently came across a paper that talks about experimenting with reinforcement learning without manually specified reward functions by having humans communicate objectives to the agent directly through expert demonstrations, trajectory preferences, and policy feedback. For this reinforcement learning problem, they have trained a deep neural network to model the reward function.
Reinforcement learning is basically when you have an agent that learns from the environment through interaction when it performs actions. If its actions are favorable, then it is rewarded and vice versa. Using Atari games for RL problems are excellent because they have well-specified reward functions and for this reason are very easy to evaluate the performance of the agent.
Expert demonstrations are basically inverse RL where demonstrations infer future reward functions. The expert can be a human doing a demo for the agent or it can be synthetic (simulated human preferences), the paper later goes on proving how synthetic preference feedback is more superior compared to human preference feedback.
Policy feedback, which is also known as policy shaping depends on human information to create reward and value signals so that it can start to establish preferences which help it in its’ learning process.
Trajectory preferences are when you give an agent a substantial amount of feedback, you train its’ policy (way of behaving at a given time) so that they can start making preferences which are similar to trajectories since the preferences are the agents’ best guess in what you want it to do for you.
This paper attempts to combat two problems which arise when a human hardcodes trajectory preferences whose purpose is to have the agent learn the reward function :
- Good state space coverage (the most amount of possible solutions to a problem referred to as the state space and state space coverage is the search for this solution) is difficult to obtain if it is accompanied by random exploration. This would mean we won’t be able to communicate meaningful information to our agent.
2. It’s very inefficient and time-consuming.
The goal of their agent is to imitate the behavior of the demonstration by the human and attempt to maximize a reward function that’s inferred from demonstrations and preferences presented from the human or synthetically. Going back to the problem, the two ways they will solve the problem are:
- Use the deep Q-learning from expert demonstrations (using a small amount of demonstration data to speed up the learning process of an agent and is able to estimate how much data is needed because of its’ ability to go back to look at its’ learning priorities) algorithm to initialize the agent’s policy with imitation learning.
- Train a reward model that lets us improve the policy that’s the learned from imitation learning.
There were four important components involved in the process of training the agent: expert, annotator, reward model, policy. The expert provides demonstrations, annotator gives preference feedback, reward model estimates a reward function from the annotator’s feedback and the policy which is trained by the demonstrations and rewards. For training the policy, they used deep Q-learning from demonstrations. Their reward model was a convolutional neural network (similar to neural networks, the only difference is it makes the assumption that the input is an image) which took observations as input and outputted what the result would be if the agent was to do what they observed. The video clips that were to be annotated were selected randomly and they used a synthetic oracle which provided synthetic feedback. Hence, they were able to run a large number of simulations and evaluate the model’s performance.
This papers’ goal was to explore what would happen if you tried to train an agent to play Atari games without any sort of reward function, and this is what we’ve all been waiting for. Four setups were tested and they are: imitation learning (only demonstrations), no demos (only preferences), demos + preferences, and demos + preferences + autolabels (initial trajectories used to label the preferred video clips).
Demos + preference vs. imitation learning
WINNER : Demos + preference
Demos + preference vs. no demos
WINNER: Demos + preferences
Demos + preferences vs. demos + preferences + autolabels
WINNER: Demos + preferences + autolabels (preference labels that were generated automatically from demonstrations gave this the edge)
Some factors that may attribute to the reason why the quality of the reward model might be poor are the failure of the reward model to fit the data and the failure of the agent to maximize what it’s learned so far to attain its’ reward.
When preferences and demonstrations are simultaneously used together, they perform better than if they were to be used on its’ own. It’s an effective method to use when there’s no reward. Synthetic preference feedback outperforms human preference feedback. Having a human who provides online feedback in the training loop can ensure that there is no reward hacking (when the agent figures out a clever way to maximize reward signal by doing things that you wouldn’t want it to normally do).
Thank you so much for reading! If you liked this please clap and if you want to make a suggestion please feel free to leave a comment.
Please follow my Instagram account for updates for upcoming articles https://www.instagram.com/datasciencewith_adib/?hl=en