#4: The Case for Real-World RL, RL for Truthful AI, Regularize your DRL, RL via Supervised Learning
The Case for Real-World RL
Can your RL agent survive on an island it has never seen before, like Robinson Crusoe? You don’t get a reset if something goes wrong!
In his recent talk, Prof. Sergey Levine asks this thought-provoking question to draw attention to the discrepancies between the dominant set of RL tasks attempted in the current literature and what real-world tasks entail. Namely, in real-world, typically:
- Tasks are continual as opposed to episodic, and there is no starting from “scratch”,
- Environment is not stationary but evolving,
- Prior knowledge must be extracted from past data instead of simulators,
- It is an open world, subject to exogenous interferences,
- The goal is to achieve decent success, or survive, instead of blowing up the score charts.
Prof. Levine makes the case that RL research needs to focus on such real-world tasks to realize the promise in AI, and argues that RL is the only framework that can do so. The talk is succinct and full of great insights, accompanied by a Medium post, hence getting the top spot in this issue of the newsletter!
RL for Truthful AI
OpenAI has published its work on truthful question-answering, called WebGPT, obtained via fine-tuning of the infamous GPT3 model. WebGPT takes a user question as input, browses through the web for factually correct answers, composes the answer with references, and returns it to the user.
As you might imagine, the process of composing the answer is a sequential decision-making process: The model searches the query on the web, clicks on links, scrolls up and down on webpages, decides what text snippets to quote, or go to another webpage until it composes a satisfactory answer. Hence, OpenAI has tried RL as one of the approaches to train WebGPT for the task. Spoiler alert: A Behavior Cloning + Rejection Sampling approach beats the RL solution. Still, the paper contains many insights about how RL was used and where it could be preferred over the winning solution!
Value-Based DRL Model Requires Explicit Regularization
Recent work by Aviral Kumar et al. sheds light on what might be making value-based DRL difficult: The implicit regularization effect of SGD on deep supervised learning harms value-based deep RL!
The authors propose DR3, a new regularizer to mitigate the bad interaction between SGD and TD, leading to a boost in multiple benchmarks. For more, check out Prof. Sergey Levine’s thread on the topic and the full paper.
Reinforcement Learning via Supervised Learning
Conditional Behavior Cloning, dubbed as RvS, is shown by Scott Emmons et al. to be a competitive alternative to TD-based Offline RL methods. RvS can work very well with simple Multi-layer Perceptrons (so no fancy architectures are necessary) in certain problem settings, making it worth checking out the paper, the Twitter thread on the topic, and the accompanying Python package.
Many exciting recent work on RL
Check out some select papers that recently came out on RL:
- Fully Autonomous Real-World Reinforcement Learning with Applications to Mobile Manipulation
- Autonomous Reinforcement Learning: Formalism and Benchmarking
- Safe multi-agent deep reinforcement learning for joint bidding and maintenance scheduling of generation units
- Learning Reward Machines: A Study in Partially Observable Reinforcement Learning
- CONQRR: Conversational Query Rewriting for Retrieval with Reinforcement Learning
- Autonomous Navigation and Configuration of Integrated Access Backhauling for UAV Base Station Using Reinforcement Learning
- High-Dimensional Stock Portfolio Trading with Deep Reinforcement Learning
- Recent Advances in Reinforcement Learning in Finance
- A Survey of Generalisation in Deep Reinforcement Learning
RL Positions in Academia
- Ph.D. position to use machine learning for real-time decision making at the AI lab of the Vrije Universiteit Brussel.
- Ph.D. position to study sample efficient reinforcement learning in neuroscience at the Delft University of Technology.
- Ph.D. position to study generative and reinforcement learning methods for cancer treatment at the Delft University of Technology.
- Ph.D. position to study autonomous adaptive agents with lightweight Deep Reinforcement Learning at TU Darmstadt.
If you have found this newsletter useful, consider subscribing on Medium and LinkedIn, following us on Twitter, and sharing it with your network. If you are interested in contributing stories or have academic positions to feature, reach out to us at editor@rlagent.pub.
And happy holidays as we sign off for the year! See you in mid-January with the next issue!