Member-only story
Basics of Reinforcement Learning for LLMs
Understanding the problem formulation and basic algorithms for RL
Recent AI research has revealed that reinforcement learning — more specifically, reinforcement learning from human feedback (RLHF) — is a key component of training a state-of-the-art large language model (LLM). Despite this fact, most open-source research on language models heavily emphasizes supervised learning strategies, such as supervised fine-tuning (SFT). This lack of emphasis upon reinforcement learning can be attributed to several factors, including the necessity to curate human preference data or the amount of data needed to perform high-quality RLHF. However, one undeniable factor that likely underlies skepticism towards reinforcement learning is the simple fact that it is not as commonly-used compared to supervised learning. As a result, AI practitioners (including myself!) avoid reinforcement learning due to a simple lack of understanding — we tend to stick with using the approaches that we know best.
“Many among us expressed a preference for supervised annotation, attracted by its denser signal… However, reinforcement learning proved highly effective, particularly given its cost and time effectiveness.” — from [8]
This series. In the next few overviews, we will aim to eliminate this problem by building…