#4: The Case for Real-World RL, RL for Truthful AI, Regularize your DRL, RL via Supervised Learning

Enes Bilgin

Follow

Published in

RL Agent

Sent as a

Newsletter

4 min readDec 22, 2021

--

The Case for Real-World RL

Can your RL agent survive on an island it has never seen before, like Robinson Crusoe? You don’t get a reset if something goes wrong!

In his recent talk, Prof. Sergey Levine asks this thought-provoking question to draw attention to the discrepancies between the dominant set of RL tasks attempted in the current literature and what real-world tasks entail. Namely, in real-world, typically:

Tasks are continual as opposed to episodic, and there is no starting from “scratch”,
Environment is not stationary but evolving,
Prior knowledge must be extracted from past data instead of simulators,
It is an open world, subject to exogenous interferences,
The goal is to achieve decent success, or survive, instead of blowing up the score charts.

Prof. Levine makes the case that RL research needs to focus on such real-world tasks to realize the promise in AI, and argues that RL is the only framework that can do so. The talk is succinct and full of great insights, accompanied by a Medium post, hence getting the top spot in this issue of the newsletter!

RL for Truthful AI

OpenAI has published its work on truthful question-answering, called WebGPT, obtained via fine-tuning of the infamous GPT3 model. WebGPT takes a user question as input, browses through the web for factually correct answers, composes the answer with references, and returns it to the user.

WebGPT Demonstration Interface. Source: ArXiv

As you might imagine, the process of composing the answer is a sequential decision-making process: The model searches the query on the web, clicks on links, scrolls up and down on webpages, decides what text snippets to quote, or go to another webpage until it composes a satisfactory answer. Hence, OpenAI has tried RL as one of the approaches to train WebGPT for the task. Spoiler alert: A Behavior Cloning + Rejection Sampling approach beats the RL solution. Still, the paper contains many insights about how RL was used and where it could be preferred over the winning solution!

Value-Based DRL Model Requires Explicit Regularization

Recent work by Aviral Kumar et al. sheds light on what might be making value-based DRL difficult: The implicit regularization effect of SGD on deep supervised learning harms value-based deep RL!

Comparison of Standard TD and Proposed DR3. Source: Prof. Levine.

The authors propose DR3, a new regularizer to mitigate the bad interaction between SGD and TD, leading to a boost in multiple benchmarks. For more, check out Prof. Sergey Levine’s thread on the topic and the full paper.

Reinforcement Learning via Supervised Learning

RL via Supervised Learning as a Competitive Approach. Source: Scott Emmons

Conditional Behavior Cloning, dubbed as RvS, is shown by Scott Emmons et al. to be a competitive alternative to TD-based Offline RL methods. RvS can work very well with simple Multi-layer Perceptrons (so no fancy architectures are necessary) in certain problem settings, making it worth checking out the paper, the Twitter thread on the topic, and the accompanying Python package.

Many exciting recent work on RL

Check out some select papers that recently came out on RL:

RL Positions in Academia

Ph.D. position to use machine learning for real-time decision making at the AI lab of the Vrije Universiteit Brussel.
Ph.D. position to study sample efficient reinforcement learning in neuroscience at the Delft University of Technology.
Ph.D. position to study generative and reinforcement learning methods for cancer treatment at the Delft University of Technology.
Ph.D. position to study autonomous adaptive agents with lightweight Deep Reinforcement Learning at TU Darmstadt.

If you have found this newsletter useful, consider subscribing on Medium and LinkedIn, following us on Twitter, and sharing it with your network. If you are interested in contributing stories or have academic positions to feature, reach out to us at editor@rlagent.pub.

And happy holidays as we sign off for the year! See you in mid-January with the next issue!

#4: The Case for Real-World RL, RL for Truthful AI, Regularize your DRL, RL via Supervised Learning

The Case for Real-World RL

RL for Truthful AI

Written by Enes Bilgin