Member-only story
DeepSeek Signals Next-Gen R2 Model, Unveils Novel Approach to Scaling Inference with SPCT
DeepSeek AI, a prominent player in the large language model arena, has recently published a research paper detailing a new technique aimed at enhancing the scalability of general reward models (GRMs) during the inference phase. Simultaneously, the company has hinted at the imminent arrival of its next-generation model, R2, building anticipation within the AI community.
The paper, titled “Inference-Time Scaling for Generalist Reward Modeling,” introduces a novel method that allows GRMs to optimize reward generation by dynamically producing principles and critiques. This is achieved through rejection fine-tuning and rule-based online reinforcement learning [1–1].
This development comes at a time when the paradigm for scaling LLMs is shifting from the pre-training stage to post-training, particularly the inference phase, following the emergence of models like OpenAI’s o1. This approach leverages increased reinforcement learning (computational effort during training) and more extensive “thinking time” (computational effort during testing) to continually improve model performance. Notably, o1 generates a lengthy internal chain of thought before responding to users, refining its reasoning process, exploring different strategies, and identifying its own errors.