Self-Rewarding Language Model

Published in

AIGuys

9 min readMar 27, 2024

Current Language models are bottlenecked not only by the quantity of labeled data but also by the quality of labeled data. In our previous blog, we talked about DPO as a more stable and efficient way to optimize a model than RLHF. The problem is DPO requires that you have a substantial amount of labeled human preference data in order to train a model. This paper introduces agents that can both: Act as an instruction-following model, generate a response given a prompt, and Generate and evaluate new instruction-following examples to add to their own training set.

So, without further ado, let’s jump right into it.

Summary of LLM Training Pipeline
The Problem of Labelled Data
Self-Rewarding Language Model
Result and Conclusion

Summary of LLM Training Pipeline

Before we understand Self-Reward we should take a quick look at the LLM training pipeline. Here are the different steps:

Step 1: Pre-training

Data Collection and Diversity: A vast dataset is gathered from the internet, ensuring a wide spectrum of text sources. The diversity of data is crucial for the model to understand various language styles…

Self-Rewarding Language Model

Table of Contents

Summary of LLM Training Pipeline

Step 1: Pre-training

Written by Vishal Rajput