What happens when you combine DeepSeek with OpenAI?
DeepSeek and OpenAI are mortal enemies. But what if they became friends?
Let’s compare the best models from DeepSeek and OpenAI on the GPQA, one of the hardest benchmarks.
- OpenAI o1: 78.28%
- DeepSeek R1: 71.50%
So they both score a C-average. Ouch. But what if you let them collaborate? Like cheating on a test with a friend?
A Mixture-of-Reasoning approach, which combined DeepSeek R1 with OpenAI o1 and other LLMs, achieved a record-shattering score of 82.83% in January 2025.
This makes it the highest scoring AI released so far, and the first AI to surpasses human PhD experts (81.20%).
This is huge. Here’s how it was done.
2024: Mixture-of-Agents
Back in June 2024, an AI platform called Together AI quietly published a paper called Mixture-of-Agents.
This approach combined multiple large language models (LLMs) to solve a single problem. Think of it like assembling a diverse team, each with their unique strengths, to tackle complex problems collaboratively.
The biggest discovery was that you could use an aggregator LLM to intelligently learn from the diverse responses of each individual LLM.
By combining the strengths of each individual model, the aggregator LLM was able to reach answers that one single model couldn’t achieve alone.
Together AI applied this theory to combine some of the best models of 2024 (GPT 4, Qwen 1.5, and LlaMA 3), and achieved the top score on the AlpacaEval benchmark. This was extraordinary…for 2024…
2025: Specialized Reasoning
The next year saw the introduction of a completely opposite approach with reasoning models. These models specialized individual models for a specific task to achieve new record breakthroughs.
- OpenAI o1 uses chain-of-thought to reason internally before providing an answer. It breaks every task into sub-tasks, which are each solved with specialized system tokens called reasoning tokens.
- DeepSeek R1 uses a Mixture-of-Experts architecture with 671 billion parameters. But it only activates 37 billion parameters each time, selecting the most relevant “expert” clusters for each task.
Now what if we take these powerful reasoning models of 2025, and combine them using the mixture-of-agents strategy of 2024?
This is how Ithy developed Mixture-of-Reasoning to beat the hardest benchmark.
Benchmarking with GPQA Diamond
The GPQA (Graduate-Level Google-Proof Q&A) is often the hardest test of AI proficiency, and serves as a benchmark for every new LLM.
GPQA is designed to assess an AI’s ability to handle complex reasoning tasks across various expert domains like biology, chemistry, and physics. The “Google-proof” design makes it especially hard for LLMs to answer questions with just training data or web searches.
With 198 challenging multiple-choice questions, the Diamond subset of this benchmark is so difficult that human PhD experts typically score only 81.20% in their own domain. The top LLMs today score even lower.
Final Results
Let’s compare OpenAI o1’s answers (labeled as o1_answer) against DeepSeek R1’s answers (labeled as deepseek_r1_perplexity_answer). Here’s a sample of 6 answers, out of the 198 used for the benchmark. For simplicity, the correct answer is always A.
It’s clear that neither DeepSeek R1 nor OpenAI o1 always got the right answer. But what’s most impressive is that the Ithy Mixture-of-Reasoning answers (labeled as ithy_answer) were able to discern the correct answer just from synthesizing the other responses.
Mixture-of-Reasoning
The concept of Mixture-of-Reasoning is simple. Take the Mixture-of-Agents approach, and apply it to reasoning LLMs.
In the GPQA, the Mixture-of-Reasoning score of 82.83% exceeded the OpenAI o1 score of 78.28%, and far exceeded the DeepSeek R1 score of 51.01% in the benchmark. It even beat human PhDs, who only scored 81.20% in their own field.
That makes this Mixture-of-Reasoning approach the top AI today, and the first to beat human PhD experts.
Wrong: Individual Answers
Here’s an example where most LLMs answered wrong: https://ithy.com/short/sample-question
- [Wrong answer] DeepSeek R1: “The correct answer is B) 1,2,4,5-tetramethylbenzene and 1,2,3,5-tetramethylbenzene.”
- [Correct answer] OpenAI o1: “Hence the only consistent choice is: A) 1,2,4,5‐Tetramethylbenzene and 1,2,3,4‐Tetramethylbenzene.”
- [Wrong answer] Claude 3.5 Sonnet: “Therefore, the correct answer is B) 1,2,4,5-tetramethylbenzene and 1,2,3,5-tetramethylbenzene.”
- [Wrong answer] OpenAI 4o (with Google search): “The correct pair of compounds is: B) 1,2,4,5-Tetramethylbenzene and 1,2,3,5-Tetramethylbenzene”
Correct: Aggregated Answer
But by reasoning among the individual responses from the different LLMs, Mixture-of-Reasoning was able to determine the right answer: https://ithy.com/article/nmr-mixture-aromatic-compounds-4tj6hyoulc
- “Based on a detailed analysis of the 1H NMR spectral data and the structural characteristics of the given compounds, particularly considering the insightful explanation of signal overlap for methyl groups provided by SourceB, Option A (1,2,4,5-tetramethylbenzene and 1,2,3,4-tetramethylbenzene) is the most consistent and accurate answer. This pair uniquely explains the two aromatic singlets in a 1:1 ratio and the three aliphatic singlets in a 2:1:1 ratio through the phenomenon of signal overlap in the methyl region.”
Conclusion
By combining the smartest reasoning models, Mixture-of-Reasoning can do better than any existing model and surpass human PhD experts for the first time.
The Ithy Mixture-of-Reasoning model currently includes:
- DeepSeek R1
- OpenAI o1 and o3-mini
- Anthropic’s Claude 3.5 Sonnet
- Aggregation with Gemini 2.0 Thinking
Adding the reasoning aggregation LLM allows the model to intelligently learn from the strengths of each individual input. This unlocks answers that a single LLM alone can never achieve.
Try it Now
Ask any question on ithy.com to experience how DeepSeek R1 integrates with OpenAI o1 for incredible results.
In every answer, it’s clear that the accuracy and depth of the Mixture-of-Reasoning response far exceeds any individual model.
The future isn’t OpenAI vs. DeepSeek. It’s OpenAI with DeepSeek.