What is RLCD (Reinforcement Learning from Contrast Distillation) for Language Model Alignment?

Joe El Khoury - GenAI Engineer
5 min readJul 14, 2024

--

RLCD is a method developed to adjust language models to human preferences without using human feedback data. This approach aims to address limitations in existing methods such as Reinforcement Learning with Human Feedback (RLHF), Reinforcement Learning from AI Feedback (RLAIF), and Contrastive Distillation.

RLHF vs RLAIF vs RLCD

RLHF involves three steps:

  1. Collect demonstration data and train a supervised policy.
  2. Collect comparison data and train a reward model.
  3. Optimize a policy against the reward model using reinforcement learning.

RLHF uses human-labeled data to train a Reward Model, incurring costs in scale and time. RLAIF uses a pre-trained model to improve responses and calculate Preference Scores. Contrastive Distillation fine-tunes the model to produce consistent responses without context.

RLAIF can produce similar answers, leading to low Signal-to-Noise ratio. RLAIF uses a pre-trained model (or an even stronger LLM) to improve responses and calculate Preference Scores.

the process is the following:

Contrastive Distillation fine-tunes the model to produce the same response even without context. Contrastive Distillation lacks information to recognize bad outputs.

The process is as follow:

However, these methods have limitations. RLAIF can produce similar answers when generating responses using the same prompt, leading to a low Signal-to-Noise ratio. Contrastive Distillation, considering only a single output, lacks information to recognize bad outputs as bad (there is no contrastive objective).

Thus, RLCD creates positive prompts and negative prompts, and assigns 0 or 1 to the generated data. It generates preference pairs using two contrasting prompts p+ and p-, and labels according to the prompt used, thus making use of both pairwise preferences for RL as well as directional attribute change in outputs as encouraged by prompts. RLCD then trains a preference model on the resulting pairs, which is used to guide the LLM alignment via PPO.

The key difference is that RLCD uses both harmful and harmless completions for training, providing a more nuanced learning compared to RLAIF and Context Distillation.

List of affixes for positive prompts and negative prompts (harmlessness task):

Evaluation Axes:

  • Helpfulness: Provision of practical, constructive advice to humans.
  • Clarity: Clear and understandable communication.
  • Empathy: Demonstration of understanding and consideration for the user’s situation.

Human evaluation results use 8-point Likert scale, normalized for higher scores indicating better performance. Evaluation uses LLaMa7B, 30B, and Reward Model for generating training data.

Validation evaluation uses LLaMA2.

Human annotation method:

Human Evaluation Results:

The paper presents human evaluation results using an 8-point Likert scale, normalized so that higher scores indicate better performance. The evaluation uses LLaMa7B, 30B, and Reward Model for generating training data.

Prompts for Evaluating GPT-4:

Conclusion

Thus and in simple terms, RLCD (Reinforcement Learning from Contrast Distillation) offers a novel approach to training language models without relying on human-labeled data. It creates pairs of helpful and unhelpful responses to prompts, automatically labeling them based on their source. This process trains a preference model that distinguishes between good and bad outputs, which then guides the main language model’s improvement through reinforcement learning. By eliminating the need for manual labeling, RLCD potentially allows for faster, more cost-effective, and scalable model training. It also provides flexibility in aligning models with specific goals or values by adjusting the criteria for positive and negative responses. Essentially, RLCD aims to produce more desirable language model outputs through an efficient, automated process of learning from contrasts.

Limitations and Future Work:

The paper acknowledges several limitations and areas for future research:

  1. Clarity of Alignment: The exact nature of what is being aligned through the RLCD process remains unclear. How does the attachment of affixes affect output strengthening towards specific evaluation axes?
  2. Data Quality: There is a possibility of noise near the decision boundary of class classification data. Could accuracy be improved by excluding this data and only training with high-confidence examples?
  3. Comparative Analysis: A direct comparison between RLCD and traditional RLHF methods is not provided in this study. How does RLCD compare with RLHF in terms of performance, efficiency, and scalability?
  4. Generalization: The study focuses on specific tasks (harmlessness, helpfulness, and outlining). How well does RLCD generalize to other types of tasks or domains?
  5. Long-term Effects: The long-term effects of using AI-generated feedback for language model alignment are not explored. What are the potential consequences of this approach on model behavior over time?
  6. Ethical Considerations: The paper does not extensively discuss the ethical implications of using AI-generated feedback for model alignment. What are the potential risks or biases that might be introduced through this method?

These limitations and questions highlight the need for further research in the field of language model alignment without human feedback. Future studies could address these points to provide a more comprehensive understanding of the RLCD method and its implications for AI development.

References

I’m Joe, and my ambition is to lead the way to industry 5.0 performance. I’m always interested in new opportunities, so don’t hesitate to contact me on my LinkedIn.

--

--

Joe El Khoury - GenAI Engineer

Generative AI Engineer at OnePoint France Leading the way to Industry 5.0 performance