Self-Rewarding Language Model

Vishal Rajput
AIGuys
Published in
9 min readMar 27, 2024

--

Current Language models are bottlenecked not only by the quantity of labeled data but also by the quality of labeled data. In our previous blog, we talked about DPO as a more stable and efficient way to optimize a model than RLHF. The problem is DPO requires that you have a substantial amount of labeled human preference data in order to train a model. This paper introduces agents that can both: Act as an instruction-following model, generate a response given a prompt, and Generate and evaluate new instruction-following examples to add to their own training set.

So, without further ado, let’s jump right into it.

Photo by Adi Goldstein on Unsplash

Table of Contents

  • Summary of LLM Training Pipeline
  • The Problem of Labelled Data
  • Self-Rewarding Language Model
  • Result and Conclusion

Summary of LLM Training Pipeline

Before we understand Self-Reward we should take a quick look at the LLM training pipeline. Here are the different steps:

Step 1: Pre-training

  1. Data Collection and Diversity: A vast dataset is gathered from the internet, ensuring a wide spectrum of text sources. The diversity of data is crucial for the model to understand various language styles…

--

--