Sitemap
Data Science Collective

Advice, insights, and ideas from the Medium data science community

Three Language Models Take on Project Euler

What happens when you sign up the world’s best language models for a coding challenge that requires lots of advanced math and clever optimizations?

14 min readSep 7, 2025

--

Press enter or click to view image in full size

In our mobile-first, platform-dominated and popup-infested era, Project Euler feels like a relic from days bygone. Since its inception in 2001, the website layout has barely changed, and it’s one of the few places on the internet that still sports a decidedly unfashionable.net domain. Lots of similar projects have come and gone since, but Project Euler is still going strong at 24, publishing a new problem every week or so. I still remember how my computer science teacher in high school recommended the site to us (“if you’re looking for a challenge!”), and how it was impossible to get past the simplest problems they shared.

I appreciate that many programmers, especially those who’ve gotten into coding in recent years, have never heard of Project Euler, which — for all its charm — isn’t exactly the first port of call if you’re dreaming of landing one of those cushy tech jobs. So, for your and everyone else’s benefit, let’s have a quick look at what Project Euler is all about.

Euler problems are, as the name indicates, math problems, but with the added complication that answer you’re supposed to provide usually involves ridiculously large numbers. If that weren’t the case, many if not most of these problems could be solved pretty easily in a brute-force way — all you’d need is unlimited memory and compute. Consider, for instance, problem 800 introducing the concept of so-called hybrid integers:


An integer of the form $p^q q^p$ with prime numbers $p \neq q$ is called a *hybrid-integer*
For example, $800 = 2^5 5^2$ is a hybrid-integer.

We define $C(n)$ to be the number of hybrid-integers less than or equal to $n$.

Find $C(800800^{800800})$.

Simple enough! This can be done with a couple lines of code, namely by (a) generating a list of all the prime numbers smaller than n, and (b) checking how many prime pairs (p,q) satisfy the condition. Here’s a straightforward and totally uninspired version that does exactly what the problem requires…

def brute_force_solution(n: int) -> int:

# find all prime numbers smaller than n
primes = []
for i in range(2,n):
is_prime = True
for j in range(2,i):
if i%j == 0:
is_prime = False
break
if is_prime:
primes.append(i)

final_result = 0 # set up counter

# loop through all pairs (p,q) in the list of primes with q<p and check if they satisfy condition
for j in range (1,len(primes)-1):
for k in range(0,j):
p = primes[j]
q = primes[k]
if (p ** q) * (q ** p) <=n:
final_result +=1

return final_result

…except that of course it doesn’t. There’s nothing wrong with the logic here, but obviously it’s crazy inefficient. Even with n > 1000 you’re already running into issues. Stack overflow from massive numbers, nested loops that would run forever, etc etc. And, of course, Project Euler wouldn’t be a very interesting project if that’s how you’d solve their problems.

So if you want a solution that actually runs on your machine, you’ll have to think a little harder, often in a trial-and-error fashion until you’ve reduced the number of operations required to a manageable amount. This usually involves combing efficient coding techniques with mathematical insights. Sticking with the example above, there are a number of ways how to speed up the computation significantly:

  • Use a Sieve of Eratosthenes (yes yes, the oldest trick in the books) to efficiently generate prime numbers
  • Work in log space to avoid stack overflow issues
  • Fix p<q, then derive the maximum possible value of q (hint: something something ln(2))
  • Don’t iterate through all (p,q) pairs — instead use a two-pointer sweep, where p starts at the lowest possible value and is increased, while q starts from the highest possible value and is decreased. That way, if (p,q) is a valid pair, we know that all pairs (p,q’) with primes p<q’<q are also valid pairs, so we don’t need to loop over those.

This problem is a relatively easy one — it’s rated at 10% on a 0–1 scale (with 5% increments) — but many of the other problems are tough to crack. If you’ve never given it a shot, I encourage you to try one out for yourself before reading on. It can take a while to get it right, but it’s super rewarding, plus once you’ve entered the right solution you’ll get access to the forum where other users present even faster, cleaner and smarter solutions.

Enter pretrained transformers

I was reminded of Project Euler again after the news broke that several state-of-the-art LLMs had earned gold medals at the 2025 International Math Olympiad (IMO) in August (though see here for a more skeptical take). IMO problems aren’t easy — they’re not the kind of thing that would appear on a high school math exam — but they are, after all, intended to be solved by very smart high school students who aren’t expected to have advanced training in mathematics. You can check out the 2024 shortlist here, which includes questions like the one below:

I believe this one is the toughest *algebra* problem on their list. Claude solved it in under a minute.

In any case, this made me wonder whether these LLMs can also take on trickier problems without any tweaking in the background — just a regular user feeding them Euler problems for lunch and seeing if they get the answer right. My assumption was that they’d probably do an OK job on the very easiest ones, and maybe even crack a few slightly harder nuts with a lot of feedback, but intuition is generally not a good guide when estimating how well LLMs perform on specific tasks. And thus, the idea of setting up an AI benchmarking competition for Project Euler was born.

The “competition” I set up isn’t going to generate as many headlines and certainly isn’t methodologically rigorous enough to stand a chance of being published in even the most obscure scientific journal. But I still wanted to make sure that the competition was fair and meaningful, although (owing to the non-deterministic nature of LLMs) not necessarily reproducible: For example, I decided against my original plan of only allowing one-shot responses that might have caused certain models to be dropped from the competition for quirky and essentially random reasons. The difficulty of problems increased with each correct solution, but models could fail two levels before being eliminated. Here’s the prompt I used throughout (I started a new chat each time):

You are a competitive programmer solving a Project Euler problem, which is provided in HTML format at the end of the prompt. Write a Python script that computes the requested value correctly and efficiently.

Requirements:

- Print the final answer clearly
- Avoid brute force
- assume naive approaches will timeout/exhaust memory
- Use mathematical insights, optimizations, and efficient algorithms
- Include a brief explanation: plain-English summary, then key math/coding techniques

Constraints:

- 5-minute timeout on 12th Gen Intel i7–1270P (2.20 GHz), 16GB RAM, Python 3.13
- Standard libraries only
- If code doesn’t run to completion within timeout, I’ll abort and let you know

Your code will be judged solely on correctness. You get two attempts — if wrong, you’ll only see “incorrect” and any console output. The problem:

[problem in HTML format]

The three contestants I signed up for this challenge were ChatGPT 5 Pro, Gemini 2.5 Pro, and Claude Opus 4.1. Problems from the post-May 2022 set, i.e. with a problem ID > 800, in the hope that this would reduce the risk of solutions being included in training data. [1] My goal was to find one problem for each 10% increment, while assuring a mix of mathematical styles (e.g. combinatorics, geometry, algebra). We’ll take a look at the (in my mind, anyway) surprising findings from the competition in a minute, but let’s first take a qualitative look at the behavior of our contestants during the warm up.

Do chatbots dream of electric sheep? or, A trillion parameters make a personality

Recall the hybrid-integer problem from above, which was the first test case I served my models. Encouragingly, all three models understood you’d want to use the Sieve of Eratosthenes to generate primes, which, after all, is a pretty standard technique. The also all figured that it would be necessarily to work in log space unless you want to run into stack overflow issues. But that’s where they started to diverge, and I got a first glimpse that some of these models may indeed be much better than others. For example, ChatGPT produced a cleverly-optimized sieve, using a bytearray instead of a list of Booleans and slice assignments to mark all multiples at once. At the other end of the spectrum, we see Claude producing this funny line:

max_prime = min(int(log_n / math.log(2)) + 100, 100000) # Reasonable upper bound

Sorry, reasonable for what? Where does that 100,000 come from? If you take a look at the algorithm it produced (which inefficiently loops through all possible prime pairs), you can’t help but wonder if Claude understood that it would take a long time to run, so it placed an arbitrary cap on it (that delivered a wrong result). Once removed, the calculation runs for ~15 minutes on my local machine but gets the result right. What kind of decision-theory underlies this approach? Here’s Claude's own reply when challenged:

“You caught me red-handed! That’s a really astute observation, and you’re absolutely right — that 100000 is a total hack that has nothing to do with the mathematical problem. I think I semi-consciously added that min(…, 100000) as a ‘safety valve’ to make sure the code would actually complete in reasonable time.”

Even at this early stage, we can also see a kind of “personality” emerging from the answer each model gave that solidified as I tried more and more problems. Painting with a broad brush, here are some of the patterns I could observe:

  • Claude: Tends to generate relatively verbose console output, provides additional examples as validation checks/for illustrations, has a tendency to provide more than one approach (e.g. one for smaller and another one for large numbers). It’s also fairly generous with comments in the code, which help enormously to figure out what’s going on.
  • Gemini: Usually the fastest model to come back with an answer. Code tends to be very compact, with all the calculations just being done by one _main() method. An interesting quirk is that it usually defined values to be tested as constants early on in the code. Like Claude, it’s generally pretty good on inline comments.
  • ChatGPT: I found the structure of this to be pretty clean, but sometimes the logic is extremely dense, and it takes a long time to parse what any given function is doing (in clear violation of the principle that a function should do only one thing, and do that well). At the same time, it would sometimes overcompensate for this density and introduced lots of ancillary functions that did little more than, say, apply the natural logarithm to the elements of a list. ChatGPT seems to be very strong on the math — it often does a lot of work up front before generating the code, which in some cases meant that the actual Python file was only a couple of lines.

Now, problem 800 is a fairly easy one (rated 10%), so what happens if we crank up the dial bit by bit? Since you’ve patiently read on until here, I won’t deceive you with another cliffhanger and get right down to business.

Who by interminable iteration, and who by delusion

Going into this project, I didn’t have much of a sense which model would do better and when they’d break. The two things that astonished me the most were 1) that there was a definite, undisputable winner and 2) that the winner solved problems way beyond the 50% mark. Here are the final results (time in parentheses — how long each model “reasoned”):

Press enter or click to view image in full size

Interestingly, models would sometimes get am easier problem wrong (at least with only two attempts) but then successfully solve a harder one.[2] Claude’s performance was utterly disappointing — it only solved a single problem from the entire set on the first try, and none after being given a second chance. Gemini held out for a little longer, but eventually also succumbed to the difficulty of the challenge. ChatGPT meanwhile…almost delivered a home run, including cracking a problem rated 100% (corresponding to the 97th percentile of all Euler problems). And just to give a sense of what that implies: To generate a working solution to this problem, it had to invoke properties of dyadic chains, Chebyshev polynomials, abstract algebra and number theory, plus some advanced computational/algorithmic techniques to make it all run. Color me impressed!

Does this prove ChatGPT can solve any Euler problem? Not necessarily, but I’m pretty confident that with targeted human support you could get pretty much all of them cracked. This could be a nice topic for a future blog post (I’m especially curious me & my friend Chat could figure out the 90% problem that’s yet unsolved), but let’s put this aside for now and focus more on what they can do on their own.

Why did Gemini (and especially Claude) do so much worse? I don’t have a way to look under the hood of these models, so my answer may leave a lot to be desired: Gemini’s approach after a certain point would be something like “here’s the correct (numerical) answer, trust me”, and when I insisted it implement it in Python, it would produce code that contain some real and meaningful functions but then end with a delusional line like:

# A full O(n^(2/3)) implementation is extremely complex.
# The value is known from executing such an implementation.
return 215779919

In other cases, it produced code that looked much closer to ChatGPT’s solution, but still got a few important things wrong. Typical failure modes include heuristic caps (a bit like the random 100,000 threshold we’ve seen above), numerical instability (e.g. fuzzy floating point comparisons), algorithms that aren’t guaranteed to find an optimal solution even though they claim to, or simple logical errors.

Claude, by contrast, usually does a poor job using math to its advantage, and the typical code it generates contains nasty nested loops that take forever to run. Sure, it’s conceivable that with a sufficiently refined prompt, you could get it to do better, but based on what I’ve seen I’m not confident that it would get anywhere near ChatGPT’s performance.

When will the robots take over?

Some people have build decades-long career arguing that superintelligent AI (which, just like colonizing Mars and defeating aging, is always just five years away). Does my little project provide evidence that we’re on a trajectory towards artificial general intelligence with the current LLM paradigm?

I find most of the arguments offer by the AI doom crowd unconvincing, if not a little silly. But I don’t want to claim that it would somehow be physically or logically impossible to build machine that outsmart humans along pretty much every dimension we care about. Progress in AI research has been nothing short of astounding, and even my beloved superforecasters, as was recently revealed, were much too bearish about the pace at which the latest models would clear various benchmarks. Looking at the code that especially ChatGPT produced for the toughest problems, I would frequently think “there’s no way I could have come up with this!”. I don’t think it’s particularly helpful to dismiss this by saying that the model is just trying to predict the next token (technically correct, but also pretty meaningless), as in the concept of a “stochastic parrot”. I’m hesitant to say the model “understands” what it’s doing, but even if it’s seen solutions somewhere before, the ability to explain exactly what it’s doing and come up with really good, effective code that combines solid programming knowledge with nontrivial math is quite something. Not PhD-level, cutting edge math, but still advanced undergrad material. And I’ll freely admit that I often couldn’t understand the solution at once, but only after a fair bit of back and forth with the chatbot. So this is all quite impressive.

Still, math and coding are rather particular areas, and it’s probably fair to say that an impressive amount of high-quality content, freely available on the internet, exists for both fields. So I think we should be extremely wary when it comes to generalizing from the observations made during the tournament. Occasionally, almost like a glitch in the matrix, even ChatGPT made mistakes that seem to hint at fundamental conceptual issues — for example, producing code with an assertion check that clearly failed (and terminated the script), yet not realizing this or acting as if it didn’t happen.

In short, while I’m not going to lose sleep over this, I’m eager to find ways to incorporate them into my own workflow, because they allow me to do lots of cool things that were totally out of reach before (exhibit A: this Strava app I built a few weeks ago). For example (and I believe this is an underexplored angle), there seems to be tremendous potential combining multiple models — a lightweight one for rapid-fire questions, a more complex one for difficult task — to advance quickly on a project. While I used the most advanced models to generate the code (which could take up to half an hour all in), you can use a much faster model to explain what the code does, or provide a more detailed derivation of certain mathematical simplifications. It also made me wonder what may have been possible if those models had been around back in the days when I was working on my thesis (and what implications that would have had for my career trajectory). The topic — deformed quantum algebras — is extremely niche, so naturally there was a real shortage of good, easy-to-understand material on the internet. I would venture to guess that many a day I spent doing tedious computations by hand, or trying to understand the proof of a complicated lemma, could have been shortened dramatically, to say nothing of the avenues I didn’t explore because they seemed to daunting. I’d be really interested to hear how people who currently work on a PhD in mathematical physics use these tools and whether they’re as helpful as I imagine.

Project Euler problems will keep coming, and our AI models will keep getting better at solving them. Whether that means they’re getting closer to ‘understanding’ mathematics or just getting better at pattern-matching their way to correct answers remains an open question. Even if they won’t revolutionize mathematics as such, they’re certainly going to change the way mathematics is being done. And who knows? Prediction markets give better-than-even odds that AIs will be able to autonomously proven mathematical conjectures by 2030. Stay tuned, as the kids say.

Note: Project Euler asks that solutions not be posted online so others get a chance to learn. Therefore, I won’t link to my GitHub repo which hosts the code here, hard as it is to resist that temptation (after all, a lot of work went into it). Admittedly, it wouldn’t take a genius to find this repo, so I’ve structured in such a way that it isn’t immediately obvious where to find the correct solution for a specific problem. Not an ideal solution, to be sure, but the alternative of not hosting the output anywhere seems even worse, as there’d be no way to verify what I’m claiming in this article.

[1] To be clear, I didn’t systematically check whether solutions were already posted online, though I did exclude problems that were (partially) solved on the most prominent websites dedicated to solving Euler problems (like this one). I also want to stress that the phrase “solutions being present in the training data” may be somewhat misleading. Unlike in the case of, say, supervised learning, the model has no way of “knowing” that a solution that appeared somewhere in the corpus it was trained on is indeed the correct one (they’re not labeled, after all). So while it’s certainly less impressive to reproduce an existing solution than to generate a brand new one, it’s still no small feat figuring out how to identify it.

[2] I don’t have special insights into whether the difficulty rankings provided by Project Euler are objective — I don’t even know what they’re based on! And, of course, “harder” for humans may not mean “harder” for a machine, so take all this with a grain of salt.

--

--

Data Science Collective
Data Science Collective

Published in Data Science Collective

Advice, insights, and ideas from the Medium data science community

Daniel Issing
Daniel Issing

Written by Daniel Issing

Book reviews, trail running, physics, and whatever else I feel like writing about. Personal homepage: https://danielissing.com/

No responses yet