[Research Paper Summary]Training Language Models to Self-Correct via Reinforcement Learning

Himanshu Bamoria
Athina AI
Published in
4 min readOct 3, 2024

Original Paper: https://arxiv.org/abs/2409.12917

By: Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust

Abstract

Self-correction is a highly desirable capability of many LLMs, but in modern LLMs it has been routinely shown to be near-worthless

Existing approaches to self-supervision for this problem requires one stronger model and/or a different form of supervision. They also require the application of multiple models.

To this end, we introduce SCoRe, a self-Correcting online reinforcement learning (RL) approach with multiple rounds that leverages only the data generated by itself to significantly improve an LL M’s ability to correct its own mistakes.

We begin by showing that self-correction behavior is not naturally learned through offline model-generated correction traces or variations of supervised fine-tuning (SFT) we attempted to use to build SCoRe.

In particular, we show that an SFT-based training would likely suffer from either self-herding — a distribution mismatch between the model responses and the characteristics of the training data satisfying which could be only very rarely or never experienced during action correction in practical applications; or would degenerate to effectively always selecting one kind of correcting behavior that is frequently ineffective.

SCoRe overcomes these issues by working on the model-native distribution of self-generated correction traces and regularizing it to make the model learn a well-tuned self-correction mechanism when performing on test examples, rather than fitting high-reward responses for a given prompt.

The regularization implied by the maximum entropy objective also indicates adding a reward bonus to encourage self-correction during training post first-phase of RL on the model, once we arrived at a useful policy initialization less likely to collapse.

Examining the performance of SCoRe when applied to the Gemini 1.0 Pro and 1.5 Flash models, we find that it leads to state-of-the-art self-correcting performance on both MATH (with improvements in BLEU of 15.6% and 9.1%)and HumanEval benchmarks compared to base models.

Summary Notes

Figure: Left: SCoRe achieves state-of-the-art self-correction performance on MATH Right: SCoRe inference-time scaling: spending samples on sequential self-correction becomes more effective than only on parallel direct samples

Overview

LLMs are the foundation of numerous applications in the quickly developing field of artificial intelligence, including code creation and mathematical problem solving.

Nevertheless, these models frequently exhibit difficulties with self-correction, a necessary ability for independently improving their outputs.

In order to provide LLMs with efficient self-correction capabilities, a novel reinforcement learning-based framework called SCoRe is introduced in the research article “Training Language Models to Self-Correct via Reinforcement Learning”.

Techniques: The Framework of SCoRe

The key component of SCoRe’s methodology is its use of self-generated data in a multi-turn reinforcement learning (RL) strategy.

By ensuring that models learn from their own distribution of responses, SCoRe guarantees that models avoid distribution mismatches and have more correction modes than typical supervised fine-tuning (SFT) methods.

There are two primary steps that do this:

1. Stage I: Initialization: The goal of this stage is to train a model initialization that will keep the output of the first attempt close to the base model while biasing the second effort towards high-reward revisions. This is accomplished by employing KL-divergence regularization to optimize an objective that penalizes deviations on the first try.

2. Step II: Multi-Turn reinforcement learning with reward shaping: Using the strong initialization from Step I, the model is put through multi-turn reinforcement learning. In this step, a reward shaping system is introduced, which applies a payment bonus to repairs that correct errors, so encouraging corrections between attempts.

Important conclusions and outcomes

Compared to other benchmarks, SCoRe has remarkably better self-correction capabilities:

  • SCoRe achieved a significant positive self-correction rate on the MATH benchmark, outperforming the base model by 15.6%.
  • SCoRe greatly outperformed traditional approaches in coding tasks evaluated on HumanEval, showing a 12.2% improvement in accuracy between trials.
  • The model’s performance on the offline repair task MBPP-R, where it outperformed baseline models by significant margins, provided additional evidence of its ability to self-correct.

Applications and Implications

SCoRe has far-reaching ramifications that go well beyond scholarly standards.

Through tackling the innate discrepancies in distribution and the inclination of models to prioritize small adjustments, SCoRe establishes a standard for training models with the ability to independently improve their outputs.

Potential uses for this include:

  • Autonomous Code Repair: empowering models to continuously enhance their code outputs without human assistance.
  • Improved Problem-Solving: Giving LLMs the capacity to independently improve logical arguments or mathematical proofs.
  • Efficient Learning Systems: lowering the requirement for oracle feedback and facilitating more effective model learning via error correction.

Conclusion

An important step forward in the development of self-correcting LLMs is represented by SCoRe.

It gets beyond the drawbacks of conventional training paradigms by utilizing an organized reinforcement learning framework and cutting-edge reward shaping strategies.

The possibility of developing more autonomous, dependable, and intelligent systems is becoming more and more apparent as we investigate and improve this methodology.

Positively, the research also creates new directions for future investigation, like developing the framework to accommodate numerous iterations of self-correction and investigating the incorporation of finer-grained supervision throughout training.

In conclusion, SCoRe not only raises the bar for training complex, self-improving AI systems, but it also improves the self-correction skills of LLMs.

Feel free to check out more blogs, research paper summaries and resources on AI by visiting our website.

--

--

Himanshu Bamoria
Athina AI

Co-founder, Athina.AI - Enabling AI teams to build production-grade AI apps 10X faster. https://hub.athina.ai/