Self-Rewarding Language Model
Current Language models are bottlenecked not only by the quantity of labeled data but also by the quality of labeled data. In our previous blog, we talked about DPO as a more stable and efficient way to optimize a model than RLHF. The problem is DPO requires that you have a substantial amount of labeled human preference data in order to train a model. This paper introduces agents that can both: Act as an instruction-following model, generate a response given a prompt, and Generate and evaluate new instruction-following examples to add to their own training set.
So, without further ado, let’s jump right into it.
Table of Contents
- Summary of LLM Training Pipeline
- The Problem of Labelled Data
- Self-Rewarding Language Model
- Result and Conclusion
Summary of LLM Training Pipeline
Before we understand Self-Reward we should take a quick look at the LLM training pipeline. Here are the different steps:
Step 1: Pre-training
- Data Collection and Diversity: A vast dataset is gathered from the internet, ensuring a wide spectrum of text sources. The diversity of data is crucial for the model to understand various language styles…