Scaling Multi-Objective Optimization: Meta & FAIR’s CGPO Advances General-purpose LLMs
Reinforcement Learning from Human Feedback (RLHF) has become the go-to technique for refining large language models (LLMs), but it faces significant challenges in multi-task learning (MTL), particularly around reward hacking and handling complex multi-objective optimization.
To address these challenges, in a new paper The Perfect Blend: Redefining RLHF with Mixture of Judges, a research team from Meta GenAI and FAIR developed Constrained Generative Policy Optimization (CGPO), which offers a more structured approach to RLHF, advancing the performance of general-purpose LLMs.
At the heart of CGPO is the Mixture of Judges (MoJ) mechanism, which uses cost-efficient constrained policy optimization and stratification. This innovation improves the RLHF process by balancing objectives and ensuring principled tuning, achieving strong empirical results backed by theoretical guarantees. CGPO is also highly adaptable and requires minimal hyper-parameter adjustments, making it compatible with typical post-training pipelines. Its ability to detect and address reward hacking ensures it reaches Pareto-optimal solutions…