Member-only story
Featured
DeepSeek R1 Beating OpenAI In Reasoning
Recently, post-training has emerged as an important component of the full training pipeline. It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt to user preferences, all while requiring relatively minimal computational resources against pre-training.
In the context of reasoning capabilities, OpenAI’s o1 series models were the first to introduce inference-time scaling by increasing the length of the Chain-of-Thought reasoning process. This approach has significantly improved in various reasoning tasks, such as mathematics, coding, and scientific reasoning.
Several previous works have explored various approaches, including process-based reward models, reinforcement learning, and search algorithms such as Monte Carlo Tree Search and Beam Search. However, none of these methods has achieved general reasoning performance comparable to OpenAI’s o1 series models. So, let’s see what Deepseek has cooked to challenge the leader in reasoning.
Topics Covered
- Understanding Reasoning
- Deep Diving Into RLHF and RLAIF
- The Multi-Point RL Problem
- Post-Training: Large-Scale Reinforcement Learning on the Base Model
- Summarizing DeepSeek R1