Member-only story
Policy Gradients: The Foundation of RLHF
Understanding policy optimization and how it is used in reinforcement learning
Although useful for a variety of applications, reinforcement learning (RL) is a key component of the alignment process for large language models (LLMs) due to its use in reinforcement learning from human feedback (RLHF). Unfortunately, RL is less widely understood within the AI community. Namely, many practitioners (including myself) are more familiar with supervised learning techniques, which creates an implicit bias against using RL despite its massive utility. Within this series of overviews, our goal is to mitigate this bias via a comprehensive survey of RL that starts with basic ideas and moves towards modern algorithms like proximal policy optimization (PPO) [7] that are heavily used for RLHF.
This overview. As shown above, there are two types of model-free RL algorithms: Q-Learning and Policy Optimization. Previously, we learned about Q-Learning, the basics of RL, and how these ideas can be generalized to language model finetuning. Within this overview, we will overview policy optimization and policy gradients, two ideas that are heavily utilized…