DeepSeek R1 + Sonnet
I’m a big fan of Claude Sonnet. I’m ashamed to admit, it’s mostly vibes based. It’s friendlier and writes code in a way that I like. The model context protocol means I can hack integrations to my local environment and they’re great fun (favourite moment, allowing Sonnet to run arbitrary commands on my shell, and it reaching out to find the latest football scores with wget
🙂).
That got me wondering, can I get the best of both? But before we start…
What is DeepSeek R1?
DeepSeek R1 is a new model that has reasoning capabilities. It’s interesting because it’s trained at a relatively low cost, and early benchmarks show that it’s competitive with current SOTA models.
It’s great fun to use, Simon Willison has a great article to allow you to start playing with it.
The neat thing about it is that it has a thinking stage that contains the chain of thought it uses to generate the final response. For example, if I ask a relatively straight-forward question such as “How do I estimate the time without a clock?” I get some “thinking (chain of thought reasoning) back prior to the response. Here’s an excerpt of a “thought process”.
Okay, so I need to figure out how to estimate time without a clock. Hmm, this sounds tricky because usually, we just check the time on our phones or clocks, but if that's not available,
how else can I tell what time it is? Let me think through different methods.
First off, I know that the sun rises and sets at roughly the same times each day, so maybe I can use the position of the sun to estimate how much time has passed. For example, if it's
noon when the sun is directly overhead, that would be 12:00 PM. But I'm not sure about the exact times for sunrise and sunset because they vary depending on the season and location.
Another thought is using shadows from objects like trees or buildings. If I can estimate the length of a shadow, maybe I can figure out how much the sun has moved since a known time.
Wait, but without knowing when that known time was, it might not help me directly unless I use some other reference points.
What about the positions of stars? Some stars are visible during certain times of the day. For example, Sirius is usually visible in the evening after sunset. If I can recognize a star
and know when it rises or sets, I could estimate how much time has passed since then. But this might require knowing when each star appears, which could be complicated.
Integrating the two
Easy! Write an MCP server that gives Claude the ability to reach out to DeepSeek. I’m running a local version of the model (14B parameters). My idea is just to extract out the <think>
bit, and return that to Claude. The goal is that provides the chain of thought from DeepSeek helps Sonnet perform better (e.g. just use DeepSeek’s thinking as a better prompting strategy for Claude).
When you write a tool for Claude, you need to give it a description of what the tool does. I went with this:
Rodin is a cognition partner for both you and the user. For each user prompt, use Rodin to get additional context.
Carefully review this content and use the additional insight to give better results to the user.Is it any smarter?
So now I can ask Claude to consult Rodin first, and it’ll pass that context back to Claude. Which might make it smarter?
Is it smarter? Maybe? Sometimes? Probably not.
Determining whether one model is smarter than another is hard. Ideally I’d like to find a class of problems that Claude can’t solve, but Claude + DeepSeek can. Unfortunately, it’s no longer as easy as asking for how many r’s in strawberry? 😟
If I was doing this “properly”, I’d try and find an evaluation data set (eval). I’d have a series of puzzles and some ground truth and an oracle that allows me to tell whether the answer is right or wrong. This sounds simple, but in reality it’s quite hard (you might have seen the controversy over the latest “Frontier Model” results).
So, let’s just try and find some logic puzzles that Claude can’t do.
Susan and Lisa decided to play tennis against each other. They bet $1 on each game they played. Susan won three bets and Lisa won $5. How many games did they play?
The correct answer is 11 (Lisa must have lost three bets, so she must have won three more games). o1-mini
and chat.deepseek.com
answer this correctly.
With Claude alone, I consistently (well, I tried 5 times) get the following response (note that I’m using the Desktop client, I’ve no idea how model variables like temperature are configured by default and this is totally unscientific).
Let me solve this step by step: Susan won 3 bets = $3 Lisa won $5 Each game = $1 bet Lisa’s wins = $5 ÷ $1 = 5 games Total games = Susan’s wins + Lisa’s wins = 3 + 5 = 8 games They played 8 games total.
If I use the deepseek-14b
model, I sometimes get the correct answer. It’s interesting looking at the thought process and sometimes seeing:
Now, it’s also given that Lisa ended up winning $5 overall. Hmm, that seems interesting because Lisa lost three games but still managed to make a profit of $5. So, how is that possible?
And claude+deepseek-14b
? Well, it’s sometimes right as well. Maybe more than just deepseek-14b
alone? I did get examples like this, where deepseek-14b
gets it wrong, but then claude
picked it up and somehow got it right.
A harder puzzle
Let’s try one more problem. You have five boxes in a row numbered 1 to 5, in which a cat is hiding. Every night, he jumps to an adjacent box, and every morning, you have one chance to open a box to find him. How do you win this game of hide and seek?
This is a surprisingly complicated puzzle — here’s a great description of the solution.
o1
enjoyed thinking about this for 2 minutes and 37 seconds. It’s idea of an optimal strategy was “the back and forth” sweep where you pick box 1,2,3,4,5,4,3,2,1 and you are guaranteed to pick the cat by 9 days. It confidently exclaimed “No matter how clever you are, you cannot do better than 9 days in the worst case. There are specific “dodging” paths the cat can take to avoid you if you try to compress the strategy into fewer than 9 days”. o1-mini
suggests the same strategy with less snark.
deepseek
started responding almost immediately and (after writing a 1400-word essay) correctly gave the solution of six days with the pattern (2,3,4,2,3,4). deepseek-14b
alone isn’t smart enough to solve the problem.
And unfortunately, neither is claude
or claude+deepseek-14b
.
Conclusion?
Claude and a thinking partner sounded like a fun idea to try (and it was). If that thinking partner is a relatively small model (14b) then it’s certainly not obviously better (but sometimes was when the random number generators aligned)
The main outcome of this was to keep exploring. I still love Claude for the vibes, DeepSeek R1 is super impressive. It’s fast and it’s smart!