Kahneman-Tversky Optimization(KTO): Revolutionizing Language Model Training with Prospect Theory
Language models have made remarkable strides in generating text that resembles human language. However, ensuring that these models produce helpful, factual, and ethical content remains a challenge. Traditional methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have shown promise, but they have limitations.
The Limitations of DPO
While DPO has been effective to some extent, it falls short in capturing the intricacies of human decision-making. DPO relies heavily on static datasets, overlooking the dynamic nature of human feedback. This approach lacks the finesse needed to align language models with human values and preferences.
Introducing Kahneman-Tversky Optimization (KTO)
To address these limitations, researchers have proposed Kahneman-Tversky Optimization (KTO). Inspired by the psychological insights of Kahneman & Tversky’s prospect theory, KTO seeks to maximize the utility of text generations directly. Unlike DPO, which focuses on the log-likelihood of preferences.
Importantly, KTO doesn’t need detailed preference data (which is hard and expensive to collect) but only a binary signal indicating whether a piece of text is desirable or not. This binary signal could be far easier to gather at scale, making the model training process more efficient.
HALO vs. Non-HALO: Aligning with Human Judgment
HALOs differentiate themselves from non-HALO methods by their ability to mimic human judgment more accurately. While non-HALO methods focus primarily on prediction accuracy, HALOs implicitly incorporate the same kinds of biases and behaviours into their loss functions that humans exhibit according to prospect theory.
Practical Implications
- Data Efficiency: It can perform as well or better than DPO while requiring less detailed data.
- Data Imbalance Handling: KTO can handle imbalances in the data (such as having many more examples of undesirable outcomes than desirable ones) without a loss in performance.
- Bypassing SFT: KTO might allow for skipping the supervised fine-tuning phase without compromising on the quality of the text generations.
KTO Working
- Set Your Hyperparameters:
⦾ λ_D and λ_U are like dials that adjust how the model should value good versus bad outcomes. If we have more good examples, λ_D might be set lower, and if we have more bad ones, we’d increase λ_U. They’re based on the balance of desirable and undesirable results we have.
⦾ For example, if there are equal numbers of good and bad outcomes, we might set λ_D = 1 and λ_U = 1.33, reflecting a slightly higher penalty for bad outcomes.
2. Compute tau_KTO:
Tau_KTO(x, y; beta) is calculated using this formula:
This tells us how much better or worse the current prediction is compared to what we want (ideal policy) versus what the model is currently predicting (reference policy).
3. Calculate L_KTO:
L_KTO is a loss function that tells us how far off our model’s output is from what we desire. It uses the KTO value function and logistic function sigma to squash the tau_KTO into a range that we can use to update the model.
The formula for L_KTO is:
Example Calculation:
Let’s say we have an ideal policy π*(y|x) that for the prompt “How’s the weather today?” it likes the response “It’s sunny today” and gives it a high chance, and it doesn’t like “let’s make pizza” so it gives that a low chance.
1. Imaginary Numbers:
High chance for a good outcome
The model currently prefers this
Let's take β =1 for simplicity, β plays a crucial role in controlling the strength of the penalty applied to the divergence from the reference policy.
2. Calculating τ_KTO:
Let’s say it equals approximately 0.13.
3. Calculating Loss L_KTO:
- If we use a logistic function/sigmoid σ, which transforms the τ_KTO score to between 0 and 1, and we assume σ(0.13) approx 0.53.
- The weight w(y) is λ_D for good outcomes (1 in our example) and λ_U for bad ones (1.33 in our example).
- For an undesirable outcome (“let’s make pizza”), the loss would be higher because of the higher weight λ_U:
The goal of L_KTO is to minimize the loss. A lower loss for good outcomes encourages the model to prefer them, while a higher loss for bad outcomes pushes the model to avoid them.
By continually updating the model to minimize L_KTO, it learns to produce results closer to our ideal policy over time.
Insights into KTO:
1. KTO is designed to be effective even with a weaker signal because it can access more data, which is readily available in real-world situations.
2. The authors posit that KTO might perform better because it avoids learning from noisy or inconsistent data, unlike DPO. KTO doesn’t learn from examples with rewards that are too high (undesirable) or too low (desirable), avoiding potential noise.
3. When an outcome has a very high reward but is classified as undesirable, it creates a situation where the KTO might ignore that example. This does not mean that KTO will not learn from any high-reward outcomes — just those that do not align with what it considers desirable.
- To better understand, let’s consider the logistic function σ within the L_KTO loss function:
- Here, if σ(τ_KTO(x, y; β)) approaches 1 (which it would for a very high reward for an undesirable outcome), then 1 — sigma(tau_KTO(x, y; beta)) approaches 0, resulting in a negligible loss and therefore, negligible learning for that particular example.
- For instance, if “It’s sunny today” is an undesirable outcome but the model assigns it a probability of 1 (a high reward), σ(τ_KTO(x, “It’s sunny today”; β)) would be very high, making the loss near zero, and the model would not adjust its policy based on this.
4. Concept of reward function equivalence classes, explaining that different reward functions can lead to the same policy outcomes even if they add different input-specific components. This suggests that maximizing preference likelihood (as DPO does) doesn’t necessarily equate to maximizing human utility.
5. Finally, KTO handles contradictory preferences from different humans better than DPO. In a case where two humans have opposite preferences, DPO might satisfy one while making both worse off. KTO, on the other hand, avoids changing the policy in the presence of such contradictions, which might be why it performs better in datasets with diverse human inputs.
In conclusion, the advent of Kahneman-Tversky Optimization (KTO) marks a significant step forward in the quest to align language models with human values and preferences. By leveraging insights from prospect theory and employing a novel optimization approach, KTO offers promising solutions to the challenges faced by traditional methods like DPO. Its ability to efficiently utilize data, handle imbalances, and navigate contradictory preferences positions KTO as a promising framework for the future of language model training. As researchers continue to refine and apply KTO in real-world scenarios, we can anticipate even greater strides towards the development of language models that not only generate text but also reflect the nuanced complexities of human decision-making and judgment.
Paper link: https://arxiv.org/pdf/2402.01306.pdf