The Power of Human-Aware Losses: HALOs, KTO, and the Future of AI Alignment

12 min readAug 5, 2024

Aligning AI language models with human values and preferences is crucial for developing effective and trustworthy systems. Traditional methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have limitations, particularly in handling noise and data inconsistencies.

This article highlights the significance of Human-Aware Losses (HALOs) and introduces Kahneman-Tversky Optimization (KTO), a novel approach based on Prospect Theory. KTO simplifies the alignment process using binary feedback, making it more robust and efficient compared to traditional methods. We will compare HALOs with non-HALO methods and present experimental evidence of KTO’s superior performance, demonstrating its potential to revolutionize AI alignment.

Why Human-Aware Losses (HALOs) Matter

To understand the importance of Human-Aware Losses in AI language models, let’s consider a practical example of an AI assistant designed to help with customer service emails.

HALO methods integrate human biases into their loss functions. Direct Preference Optimization (DPO) might train the AI to prioritize polite and empathetic responses over technically correct but cold ones. For instance, it would favor “I understand your frustration with the delayed shipment, and I’m here to help” over “Your order was delayed due to logistical issues.” Proximal Policy Optimization (PPO) could gradually refine the AI’s responses based on simulated customer interactions, learning to balance professionalism with a friendly tone.

In contrast, non-HALO methods like Conditional Supervised Fine-Tuning (CSFT) and Sequence Likelihood Calibration (SLiC) adopt different strategies. CSFT might use specific tags to generate different types of responses, like [FORMAL] or [CASUAL], but without inherently understanding human preferences. SLiC could focus on ensuring grammatically correct and coherent responses without necessarily capturing the nuances of human communication preferences.

In experiments, HALO methods have demonstrated superior performance for several reasons. They tend to produce outputs that feel more natural and relatable to humans. In our customer service example, emails generated by HALO-trained AIs might show better empathy and understanding of customer emotions. These methods are also better at picking up on subtle cues in the input, such as detecting frustration in a customer’s tone and adjusting the response accordingly.

HALO methods can more effectively learn from human feedback, continuously improving their performance in ways that align with human preferences. By incorporating human biases and decision-making patterns, they can potentially avoid generating responses that might be technically correct but socially inappropriate or insensitive. In practical applications, users tend to prefer interacting with AI systems trained using HALO methods, finding them more helpful and easier to communicate with.

By reflecting human decision-making biases, HALO methods enable AI language models to generate outputs that are not just accurate, but also more aligned with human expectations and preferences. This leads to more effective and satisfying human-AI interactions, particularly in applications where understanding and responding to human nuances is crucial. The superior performance of HALO methods highlights the importance of incorporating human decision-making biases in generative models, demonstrating that using appropriate loss functions can significantly enhance the overall performance and user satisfaction of AI language models in real-world applications.

The Game-Changing Method: Kahneman-Tversky Optimization

Among the emerging HALO methods, a revolutionary approach called “Kahneman-Tversky Optimization” (KTO) has been introduced in a recent paper featured at ICML 2024. This innovative method, crafted by researchers Kawin Ethayaraj, Winnie Xu, Niklas Muenninghoff, Dan Jurafsky, and Douwe Kiela from Stanford University and Contextual AI, introduces a fresh approach to aligning large language models (LLMs) by eliminating the need for cumbersome pairwise comparisons.

KTO utilizes binary feedback, which is inherently more stable and less prone to noise, directly maximizing the efficiency of generated outputs. This makes KTO a robust and promising alternative to traditional methods such as Proximal Policy Optimization (PPO).

Why KTO is a Game-Changer

Aligning LLMs with human intentions is essential for enhancing their capabilities and ensuring safety. Traditional methods like RLHF and DPO have been widely used but come with significant challenges, particularly in data collection and susceptibility to noise.

Since these methods require pairwise data, learning the reward model is challenging, and they are susceptible to noise and non-stationarity (feedback inconsistency).

Source: Human Aware Loss Functions (HALOs) report

Figure 1 illustrates the concept of LLM alignment. The process involves supervised fine-tuning followed by optimizing a human-aware loss (HALO). Traditional approaches like RLHF use pairwise data comparisons to refine the model, while DPO relies on direct pairwise data for preference optimization, requiring hard-to-obtain pairwise data comparisons.

The traditional approach requires pairwise data comparisons, which are indicated by the boxes labeled A, B, C with a hard-to-get sign. This approach is primarily reward-centric.

In contrast, KTO simplifies this by utilizing abundant binary feedback, making the data collection process easier and more resilient to inconsistencies. It directly evaluates single responses using a simpler thumbs-up or thumbs-down system, bypassing the need for difficult pairwise data comparisons.

How KTO Stands Out Against Traditional Methods

KTO stands out against traditional methods due to its use of binary feedback to directly maximize the efficiency of outputs. This approach simplifies policy optimization and reduces the impact of noise and non-stationarity, significantly enhancing data efficiency.

In contrast, RLHF relies on KL divergence for its loss function, which results in generally low data efficiency. DPO, which uses logistic regression loss, offers moderate data efficiency. However, KTO’s Kahneman-Tversky value feedback loss achieves high data efficiency and stability, making it a superior choice for real-world applications where data collection can be challenging.

Overall, KTO’s method of using binary feedback not only makes data collection easier but also ensures robust performance in noisy and non-stationary environments. This makes KTO a more practical and effective solution for optimizing large language models compared to traditional methods like RLHF and DPO.

The Magic Behind Kahneman-Tversky Optimization

Prospect theory, introduced by Tversky and Kahneman in 1992, provides a framework for understanding how individuals evaluate and make decisions under risk. This theory is integral to KTO’s approach to optimizing LLM alignment.

The value function v: Z → R assigns subjective value to an output z relative to a reference point z_0 . Tversky and Kahneman’s experiments led to the formulation of the value function as:

where alpha controls the curvature (risk aversion) and lambda represents loss aversion.

Characteristics of the Value Function

The value function is characterized by three main principles:

Firstly, the loss aversion principle emphasizes that the sorrow of losing a certain amount is greater than the joy of gaining the same amount. It reflects the human tendency to prefer avoiding losses over acquiring equivalent gains.
Secondly, the S-shaped utility function reflects diminishing sensitivity to gains and losses. As the amount of gain or loss increases, the incremental impact on utility decreases.
Lastly, the function indicates high sensitivity to small gains and losses, mirroring human tendencies to be highly reactive to minor changes in outcomes.

While the below image illustrates the “Implied Human Value” function for different approaches: Kahneman-Tversky, PPO-Clip, and DPO. It represents how these different methods model human value judgments in relation to gains and losses.

The x-axis represents gains (to the right of the origin) and losses (to the left), while the y-axis represents the perceived value or utility.

The PPO-Clip, shown by the yellow curve, takes a more balanced approach. While it still prioritizes avoiding mistakes, its response to feedback is less extreme than the KTO method. This allows for gradual improvements based on input, showing a smoother transition between avoiding errors and pursuing excellence. The outcome is often a mix of safe and engaging content, striking a balance between reliability and creativity.

The DPO, depicted by the blue curve, it treats improvements from bad to acceptable similarly to improvements from good to excellent. This method doesn’t overly worry about minor mistakes and values all levels of enhancement equally. As a result, the DPO tends to produce more varied outputs, potentially including both high-quality descriptions and occasional misses.

The reference point in the middle of each curve represents an average product description. moving left indicates worse descriptions, while moving right signifies better ones.

As we move further from this midpoint in either direction, all curves flatten out, illustrating that extreme feedback (very positive or very negative) has diminishing impact on the model’s learning process.

The Kahneman-Tversky Optimization (KTO), represented by the orange curve, is highly sensitive to negative feedback. it reacts strongly even to minor criticisms, treating small mistakes as significant losses. however, it shows diminishing returns for positive feedback, caring less about the difference between good and excellent descriptions. this approach typically results in safe, accurate product descriptions that rarely contain errors but might lack creative flair.

KTO is particularly valuable in scenarios where risk mitigation is crucial, such as in medical advice, financial reports, or legal documents. Its strong aversion to mistakes makes it ideal for industries with strict regulatory compliance needs or where brand protection is paramount. KTO can help establish a reliable baseline of performance and build user trust, especially in new AI applications where users might be skeptical. it also aligns closely with human risk-averse decision-making, potentially producing outputs that feel more natural or relatable to users in certain contexts.

The Core of Kahneman-Tversky Optimization (KTO)

KTO leverages the value function proposed by Kahneman and Tversky to optimize directly for utility rather than pairwise preference differences. The KTO loss function is formulated as:

where lambda_y denotes values for desirable and undesirable outcomes, respectively. This method bypasses the complexity of estimating the reference point z_0 by using a heuristic based on human beliefs and sampling methods.

Handling uneven sample numbers is another strength of KTO. By adjusting hyperparameters lambda_D and lambda_U, KTO can control the degree of loss aversion, ensuring performance even in class-imbalanced settings.

KTO: Estimation of KL

In practice, it is neither easy nor realistic to estimate the reference point z_0. Sampling from pi_theta is slow, and humans do not recognize the complete distribution induced by pi_theta. based on the actual beliefs of humans, an estimated value with bias in the KL term is obtained. Humans have an availability heuristic and tend to overestimate the output that they have already given feedback on (Tversky & Kahneman, 1973).

To better simulate the reference point recognized by humans, the following method is adopted:

Within the same size batch of offline data, create m pairs of x_i and y_i (where i different from j ).
Estimate z_0 as follows:

KTO: Dataset Transformation

Standard feedback datasets in academic research, such as HH, SHP, and OASST, are formatted based on preference data. For example, in these datasets, responses might be marked as “liked” or “disliked” by users. In the experiment, preference data y_w is transformed as follows:

To demonstrate that KTO can also be used with non-preference data, exactly one y per x is sampled for each x in a one-y-per-x experiment. This means that for each input x (e.g., a customer query), only one output y (either desirable or undesirable) is used.

However, there are challenges present:

Developing complex methods for dividing preferences into binary feedback is still difficult. For instance, determining whether feedback should be classified as simply “positive” or “negative” can be challenging when feedback contains nuanced opinions.
Another significant area of exploration is designing score-based HALO (Human-Aware Loss). This design is both possible and necessary. For example, instead of just binary feedback, assigning scores (e.g., 1 to 5 stars) could more accurately capture human preferences and provide richer data for training AI models.

KTO: Hyperparameters (Handling Uneven Sample Numbers)

The default weighting function in Kahneman-Tversky Optimization (KTO) uses two hyperparameters, λD and λU, to control the degree of loss aversion for desirable and undesirable outcomes, respectively. These parameters help balance the model’s sensitivity, particularly in class-imbalanced settings where the number of desirable samples (nD) and undesirable samples (nU) are uneven.

For instance, if there are 10 undesirable samples for every desirable sample, the ratio of nD to nU is 1:10. To handle this imbalance, λU is set to 1, providing a baseline for undesirable outcomes. Then, λD is adjusted within the range of 10 to 10.33 based on experimental results. This specific range helps ensure that the model appropriately weights the fewer desirable samples more heavily.

Consider a scenario where you have 10 desirable samples (nD=10) and 100 undesirable samples (nU=100). The ratio of desirable to undesirable samples is 1:10. To balance the model’s sensitivity to these uneven samples, λU is set to 1, providing a baseline for the undesirable outcomes. Then, λD is adjusted to a value between 10 and 10.33. This adjustment ensures that the fewer desirable samples are weighted more heavily, allowing the model to learn effectively from the limited desirable data while maintaining an appropriate balance with the more abundant undesirable data.

Unlike the traditional Kahneman and Tversky value function, which is more sensitive to losses, this approach in KTO is more sensitive to gains. This ensures that the model can effectively learn from the limited desirable data, providing a balanced and nuanced understanding even in class-imbalanced settings.

Experimental Insights: KTO in Action

Win Rate Evaluation

KTO consistently outperforms DPO across various model scales, particularly in Llama models (7B, 13B, 30B) . statistical significance is observed at the 7B and 30B scales (p < 0.01), indicating superior performance even with minimal desirable data.

Pythia models show no significant difference, suggesting the need for a minimum model capacity for KTO’s effectiveness.

This reults that for KTO to demonstrate its advantages, models need to have a sufficient number of parameters to leverage the alignment method effectively. this threshold is indicated by the lack of significant differences in the performance of Pythia models, while Llama models (7B, 13B, 30B) show significant improvements.

Generative Benchmarks

In the GSM8K dataset, focused on mathematical reasoning tasks, KTO improves performance by 13.5 points compared to DPO. this significant improvement underscores KTO’s potential to enhance generative models across a variety of applications.

At Sufficient Scale, KTO Does Not Need SFT

KTO-aligned Llama models (13B, 30B) demonstrate competitive performance with SFT + KTO models, even without prior Supervised Fine-Tuning (SFT). this suggests that KTO alone can maintain average response length similar to DPO without needing pre-SFT, likely enhancing average response growth when DPO is not pre-applied.

Notably, DPO-aligned models without SFT tend to ramble and hallucinate, whereas KTO does not suffer from these issues.

Data Imbalance and Robustness

KTO maintains high performance even with significant data imbalance. In the Llama-7B model, KTO outperforms DPO even when up to 90% of the desirable data is removed. despite a 72% reduction in training data volume, models aligned with KTO outperform DPO-aligned models and the official Mistral-7B-Instruct.

Choosing Between KTO and DPO

When deciding between KTO and DPO, the nature of the feedback and data is crucial. KTO is ideal for binary feedback scenarios, imbalanced data, and high-noise environments. it offers stable performance even when feedback data is inconsistent.

On the other hand, DPO excels with abundant, low-noise preference data, where feedback is highly accurate and indicates clear preferences.

Conclusion

Kahneman-Tversky Optimization (KTO) has proven to be a highly effective method for aligning AI models, particularly in noisy or imbalanced data scenarios. it introduces a fresh perspective on AI alignment by incorporating human decision-making biases and risk aversion tendencies, grounded in Prospect Theory.

This approach enables the model’s output to be finely tuned to human preferences, providing significant improvements over traditional methods like DPO.

Experiments conducted across multiple scales and settings consistently show that KTO often outperforms DPO, especially in cases where feedback includes noise or when data is imbalanced. this robustness and efficiency make KTO a promising method for future AI alignment tasks. As research continues, exploring more complex methods for binary feedback division and score-based HALOs could further enhance the effectiveness of human-aware losses in generative models.

In summary, KTO represents a significant advancement in AI model alignment, offering a new way to create more human-centric and robust AI systems. by integrating insights from Prospect Theory, KTO paves the way for the next generation of AI models that align more closely with human intentions and preferences.

References

GitHub HALOs page: https://github.com/ContextualAI/HALOs
KTO: Model Alignment as Prospect Theoretic Optimization (https://arxiv.org/pdf/2402.01306)

I’m Joe, and my ambition is to lead the way to industry 5.0 performance. I’m always interested in new opportunities, so don’t hesitate to contact me on my LinkedIn.