TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Member-only story

Proximal Policy Optimization (PPO): The Key to LLM Alignment

Cameron R. Wolfe, Ph.D.
TDS Archive
Published in
18 min readFeb 15, 2024

--

(Photo by Daniel Olah on Unsplash)

Recent AI research has revealed that reinforcement learning (RL) — reinforcement learning from human feedback (RLHF) in particular — is a key component of training large language models (LLMs). However, many AI practitioners (admittedly) avoid the use of RL due to several factors, including a lack of familiarity with RL or preference for supervised learning techniques. There are valid arguments against the use of RL; e.g., the curation of human preference data is expensive and RL can be data inefficient. However, we should not avoid using RL simply due to a lack of understanding or familiarity! These techniques are not difficult to grasp and, as shown by a variety of recent papers, can massively benefit LLM performance.

This overview is part three in a series that aims to demystify RL and how it is used to train LLMs. Although we have mostly covered fundamental ideas related to RL up until this point, we will now dive into the algorithm that lays the foundation for language model alignment — Proximal Policy Optimization (PPO) [2]. As we will see, PPO works well and is incredibly easy to understand and use, making it a desirable algorithm from a practical perspective. For these reasons, PPO…

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Cameron R. Wolfe, Ph.D.
Cameron R. Wolfe, Ph.D.

Written by Cameron R. Wolfe, Ph.D.

Director of AI @ Rebuy • Deep Learning Ph.D. • I make AI understandable

No responses yet