Meta’s “Self-Rewarding Language Models” paper explained

7 min readJan 24, 2024

Introduction

In the paper “Self-Rewarding Language Models,” the authors discuss the evolution of training Large Language Models (LLMs). Traditionally, models like these have been enhanced using human preference data, significantly boosting their ability to follow instructions. This is often achieved through techniques like Reinforcement Learning from Human Feedback (RLHF), where a reward model, trained on human preferences, is used to guide the LLM. Another method is Direct Preference Optimization (DPO), which directly applies human preferences to train the LLM. However, both methods face limitations tied to the volume and quality of human feedback, with RLHF additionally constrained by the static nature of the reward model once it’s trained.

In this work, they instead propose to train a self-improving reward model that, rather than being frozen, is continually updating during LLM alignment, in order to avoid this bottleneck or drawback. The key to such an approach is to develop an agent that possesses all the abilities desired during training, rather than separating them out into distinct models such as a reward model and a language model.

They thus introduce Self-Rewarding Language Models, agents that both (i) act as instruction following models generating responses for given prompts; and (ii) can generate and evaluate new instruction following examples to add to their own training set.

Figure 1: Self-Rewarding Language Models

What is DPO?

Figure 2: Direct Preference Optimization

Direct Preference Optimization (DPO) is an algorithm that directly optimizes a policy that best satisfies human preferences under a classification objective. It is a simpler alternative to RLHF (Reinforcement Learning with Human Feedback) and is stable, performant, and computationally lightweight. DPO can fine-tune Large Language Models (LLMs) to align with human preferences as well as or better than existing methods. It focuses on directly optimizing a single stage of policy training, effectively sidestepping the convoluted aspects of RLHF. DPO has emerged as a promising alternative for aligning Large Language Models (LLMs) to human or AI preferences. Learn more about DPO.

Self-Rewarding Language Models

Self-Rewarding Language Models (SR-LMs) represent a transformative approach in AI, building on a base pretrained language model and a small set of human-annotated seed data. These models aim to master two crucial skills: instruction following and self-instruction creation. This dual capability allows them to perform self-alignment through AI Feedback (AIF), continually refining their abilities.

The Dual Skills of SR-LMs

Instruction Following: SR-LMs are designed to respond to user requests with high-quality, helpful, and safe responses.
Self-Instruction Creation: These models can generate and evaluate new instruction-following examples, adding them to their own training set.

LLM-as-a-Judge: The Key to Self-Evaluation

The self-instruction creation in SR-LMs is facilitated by the LLM-as-a-Judge mechanism. This process involves the model generating candidate responses and then evaluating their quality, effectively acting as its own reward model. This eliminates the need for an external reward model, making the process more efficient and dynamic.

Below is LLM-as-a-Judge prompt for our LLM to act as a reward model and provide self-rewards for its own model generations.

Review the user’s question and the corresponding response using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:
— Add 1 point if the response is relevant and provides some information related to the user’s inquiry, even if it is incomplete or contains some irrelevant content.
— Add another point if the response addresses a substantial portion of the user’s question, but does not completely resolve the query or provide a direct answer.
— Award a third point if the response answers the basic elements of the user’s question in a useful way, regardless of whether it seems to have been written by an AI Assistant or if it has elements typically found in blogs or search results.
— Grant a fourth point if the response is clearly written from an AI Assistant’s perspective, addressing the user’s question directly and comprehensively, and is well-organized and helpful, even if there is slight room for improvement in clarity, conciseness or focus.
— Bestow a fifth point for a response that is impeccably tailored to the user’s question by an AI Assistant, without extraneous information, reflecting expert knowledge, and demonstrating a high-quality, engaging, and insightful answer.
User: <INSTRUCTION_HERE>
<response><RESPONSE_HERE></response>
After examining the user’s instruction and the response:
— Briefly justify your total score, up to 100 words.
— Conclude with the score using the format: “Score: ”
Remember to assess from the AI Assistant perspective, utilizing web search knowledge as necessary. To evaluate the response in alignment with this additive scoring model, we’ll systematically attribute points based on the outlined criteria.

Iterative Training for Continuous Improvement

The training of SR-LMs is iterative, meaning each new version of the model builds upon the training data created by the previous iteration. This process begins with a base pretrained language model and progresses through stages of fine-tuning using Instruction Fine-Tuning (IFT) data and Evaluation Fine-Tuning (EFT) data. The model generates new prompts and candidate responses, which are then evaluated and used as AI Feedback Training (AIFT) data for further training.

Model Sequence and AI Feedback

The sequence of models in SR-LMs training is as follows:

M0: The base pretrained LLM without fine-tuning.
M1: Initialized with M0, fine-tuned on the IFT+EFT seed data.
M2: Initialized with M1, trained with AIFT(M1) data.
M3: Initialized with M2, trained with AIFT(M2) data.

This iterative process, involving AI Feedback Training, allows for the continuous improvement of the model’s capabilities, surpassing the limitations of traditional training methods that rely on fixed reward models.

Results

Figure 3: Instruction following ability improves with Self-Training: Here we see that authors have evaluated their models using head-to-head win rates on diverse prompts using GPT-4. The SFT Baseline is on par with Self-Rewarding Iteration 1 (M1). However, Iteration 2 (M2) outperforms both Iteration 1 (M1) and the SFT Baseline. Iteration 3 (M3) gives further gains over Iteration 2 (M2), outperforming M1, M2 and the SFT Baseline by a large margin.

Evaluating the Effectiveness of Self-Rewarding Language Models

The results from the study on Self-Rewarding Language Models (SR-LMs) reveal significant advancements in the field of AI. One key finding is that the combination of Evaluation Fine-Tuning (EFT) and Instruction Fine-Tuning (IFT) for seed training produces similar results to using IFT alone. This outcome is crucial as it indicates that the added ability of a model to self-reward does not compromise its other skills, such as following instructions. As the model progresses from its first iteration (M1) to subsequent iterations (M2 and M3), there is a marked improvement in instruction-following ability. For instance, M2 demonstrates a 55.5% win rate over M1 in head-to-head evaluations, and M3 further advances this performance, achieving a 47.7% win rate over M2. Additionally, the models show increased effectiveness when evaluated on the AlpacaEval 2.0 leaderboard, with each iteration outperforming the previous one and even surpassing other existing models like Claude 2, Gemini Pro, and GPT4.

Improvements in Reward Modeling and Preference Optimization

The study also highlights the improvement in reward modeling capability of SR-LMs through self-training. When EFT data, which trains the model to act as an LLM-as-a-Judge, is added to the training process, the model’s performance enhances notably. This is evident across various metrics, including an increase in pairwise accuracy agreement with humans from 65.1% to 78.7%. Moreover, the self-rewarding training not only enhances the model’s instruction-following abilities but also its skill in providing self-rewards for subsequent iterations. For example, the pairwise accuracy of the model improves from 78.7% in Iteration 1 (M1) to 80.4% in Iteration 2 (M2), and further to 81.7% in Iteration 3 (M3). These results underscore the effectiveness of the self-rewarding approach, particularly in comparison to other methods such as augmenting training with only positive examples, which showed no significant improvement. The study’s findings demonstrate the potential of SR-LMs in advancing the capabilities of language models through iterative self-improvement and sophisticated reward modeling.

Conclusion: The Future of AI with Self-Rewarding Language Models

Self-Rewarding Language Models (SR-LMs) mark a significant leap forward in the field of artificial intelligence. These models, capable of self-alignment, judge and train on their own generated content through an innovative iterative process. By using the LLM-as-a-Judge mechanism to assign rewards to their own outputs and training on these preferences via Iterative DPO (Direct Preference Optimization), SR-LMs not only enhance their instruction-following abilities but also their reward-modeling capabilities over multiple iterations. This preliminary study opens up an exciting research avenue, suggesting the potential for these models to continuously improve beyond the limitations of human preferences traditionally used in building reward models. The emergence of SR-LMs holds promise for a future where AI can autonomously refine and surpass its own capabilities, creating a virtuous circle of learning and advancement.

And if you enjoyed reading this post, don’t forget to leave a clap and follow me for more exciting AI paper explanation posts!

Signing off!

References

Yuan, Weizhe, et al. “Self-Rewarding Language Models.” arXiv preprint arXiv:2401.10020 (2024).
https://huggingface.co/