Research: Learning from Human Preferences (OpenAI & DeepMind)

Using human inputs to infer a goal for hard-to-describe tasks

Jacob Younan
AI From Scratch
4 min readJun 14, 2017

--

Been an interesting few days in the machine learning world, as summarized quickly by Quartz’s Dave Gershgorn:

While I’m working my way through those highlights, especially Vicarious’ Scheme Networks blog post, I actually want to focus on content released yesterday from DeepMind and OpenAI (together!).

Together, they co-authored a research paper that addresses a common issue in AI safety called ‘perverse instantiation’. It’s frequently referenced in Nick Bostrom’s Superintelligence and boils down to creating negative unexpected consequences from poorly defined goals. OpenAI’s post summarizes why the issue needs to be addressed as well:

“One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior.”

From what I’ve read, one of the most common solution paths is to build ‘uncertainty’ into all our AI models — basically let humans course correct future actions rather than assuming full understanding. I first heard Stuart Russell mention this concept, and I now associate the concept with his work. Here he is in a TED talk on how uncertainty will make AI more compatible with human safety:

What’s Interesting About the DeepMind and OpenAI Paper?

Credit: OpenAI and DeepMind ‘Hopper Robot’. Lowered frame rate for file size.
  • Teamwork: The two leading non-academic AI research teams are working together on safety. Hope organizations like Partnership on AI and others facilitate more of these collaborations.
  • Train the trainer: To teach the ‘ Hopper robot’ — yellow mutant straw? — to back flip using human feedback alone would have been a sad and painful supervision task. Instead, the team used only 900 bits of human feedback data (about an hour of ‘this or that’ decisions) to train a reward predictor to accurately simulate 70 additional hours of human feedback.
Credit: OpenAI post
  • Simulation: Training in simulation environments is essential to speeding up learning, like AlphaGo playing virtual games to improve. It also cuts down on human effort and resources (i.e. OpenAI’s Robots That Learn). I wonder what could drop that 900 bits of feedback down significantly? For back-flipping, perhaps some kind of labeled demonstration data (real or CG) would be a good head-start, but I think the point here is to solve for goals that are hard to define or replicate visually.
  • Comparison vs. Scoring Feedback: The reward predictor (purple block above) had a much easier time predicting a human’s choice in a comparison setting than a scoring one. When scoring, people (or multiple people) are going to be less consistent with their feedback of the same action. That said, I wonder if a mix of scores and comparisons could provide a quicker path here? If I’m starting with random actions, it may be more helpful for someone to tell me if I’m way off (1/10) or quite close (9/10) in the early stages. This seems to be the equivalent of training a neural net with a higher learning rate at first, then lowering the rate as you get closer to describing the precise solution. Note that I‘m probably mixing metaphors here...

The last two paragraphs from the research paper are the best parting thoughts for what you should identify here as progress and room for improvement:

“Although there is a large literature on preference elicitation and reinforcement learning from unknown reward functions, we provide the first evidence that these techniques can be economically scaled up to state-of-the-art reinforcement learning systems. This represents a step towards practical applications of deep RL to complex real-world tasks.

Future work may be able to improve the efficiency of learning from human preferences, and expand the range of tasks to which it can be applied. In the long run it would be desirable to make learning a task from human preferences no more difficult than learning it from a programmatic reward signal, ensuring that powerful RL systems can be applied in the service of complex human values rather than low-complexity goals”

--

--