An interview with one of the Stanford AI researchers behind KTO: Teaching AI to think like humans

Simon Tiu
Vertex Ventures US
Published in
3 min readAug 2, 2024

One of the many reasons I love AI right now is that every new discovery or new development seems like a peek into the fundamental nature of humans and psychology. In the latest episode of the Neural Notes podcast — my first as co-host — my Vertex US colleague Sandeep Bhadra and I speak with Kawin Ethayarajh, a key Stanford AI Lab (SAIL) researcher behind the promising approach to LLM model alignment called KTO, or Kahneman-Tversky Optimization.

This innovative approach teaches AI language models to better understand and respond to human preferences by drawing on insights from behavioral economics, specifically the pioneering work of Daniel Kahneman and Amos Tversky. It’s a deliciously nerdy fusion of Machine Learning and Behavioral Economics!

Interviewing Kawin was a delight, offering fascinating insights into the intersection of AI, psychology, and economics. I hope you’ll find this discussion as enlightening and exciting as I did. You can watch the full episode of the podcast here:

And you can read the paper we’re discussing on arXiv here.

A speed-run primer on the basics model alignment

Though the concept of model “alignment” sounds ominously Orwellian, it’s a broad concept vital for making Large Language Models (LLMs) more useful in our human-centric world. Alignment involves adapting language models to reflect diverse human preferences and behaviors. By encoding these preferences, the aim is to make models more useful and appropriate for a wide range of user needs, rather than simply restricting their outputs.

A Simplified Summary of Model Training

The evolution of model alignment techniques over the past four years has been driven by the need to make LLMs more helpful and safer. Typically, the first step when training an LLM is a process called pre-training, where you force-feed a large amount of data to create a base model and then use supervised fine-tuning (SFT) to help that model become more accurate with examples from a curated dataset. Initially, models like ChatGPT and Llama were trained using SFT, creating powerful autocomplete systems that often produced outputs misaligned with human values and preferences. To address these shortcomings, researchers developed techniques to “align” the models to more closely reflect the preferences of end users, whatever they may be.

The OG of LLM alignment is Reinforcement Learning from Human Feedback (RLHF), popularized by OpenAI and Anthropic. The core idea is to get humans to score the quality of the model outputs (thereby embedding their human preferences) and then to use those scores to iteratively update the weights in the model. This approach, while incredibly powerful when done correctly, presented several serious challenges from the start. In addition to its technical complexities, RLHF is notoriously unstable and demands significant compute resources, especially when scaling to models with billions of parameters.

More recently, Direct Preference Optimization (DPO) emerged as a simpler, more effective technique. Developed by Stanford researchers, DPO streamlines the alignment process by using clever math to eliminate the need for separate reward models and reinforcement learning. Implementing DPO is surprisingly simple (check out TRL from Hugging Face or Axolotl from OpenAccess AI Collective) and its elegance has contributed to DPO becoming the canonical alignment algorithm, particularly in the open-source community (e.g. Mixtral).

DPO has ushered in a new wave of alignment innovation

While many advancements to improve DPO have been proposed, KTO stands out as particularly compelling. It directly maximizes the utility of model generations using a value function inspired by prospect theory, working with simpler binary feedback on whether outputs are desirable or undesirable. It incorporates key aspects of human decision-making like loss aversion and diminishing sensitivity to extreme outcomes, allowing it to learn effectively from weaker signals while being more data-efficient than existing preference-based methods.

As we continue to explore the frontiers of LLM alignment, interviews like these remind me of the incredible potential and responsibility we have in shaping the future of artificial intelligence. I’m eager to hear what you think after watching!

Reached out to me on X, on LinkedIn, or via email.

--

--