Quest for Proof Automation: What can ChatGPT4 do? simple multiplication (part 1/N)

5 min readJul 23, 2024

With the advent of GenAI, there is no reason why not to revisit some of the more naive youthhood dreams including math proof automation.

So, here’s the start of the journey.

Ideal Goal:

Generation of verifiable proof by LLM, when a problem is provided.
Generation of a creative and elegant problem by LLM, which it will solve.

Those are indeed some lofty goals, let’s see how far we can go for the summer.

0. Intro

Due to the recent advancement of GenAI/LLM, there is no reason why we should not use LLM to do math, in particular proofs. (YAY)

the problem is, LLMs are pretty bad at them. You can google for hilarious ChatGPT math mistakes and they are funny sad. This is one of the examples:

https://blog.metamirror.io/us-v-eugenerative-ai-is-not-a-calculator-6ef6e0fdc76

3One of the postulation is that LLM does not actually understand math— it only sees the question and result most of the time in training set. In deed, process trained models outperform result trained model according to this OpenAI article.

https://openai.com/index/improving-mathematical-reasoning-with-process-supervision/

Which totally make sense form a heuristic point of view.

I will be presenting some home cooked ChatGPT4 examples of its math ability, proofs or otherwise. For this particular article I will focus on the particular example of multiplication of 5 digit numbers.

1. Math ability of ChatGPT 4: simple multiplication fail

This is but one of the many examples of failings of ChatGPT:

Question: 38932 X 38943 = ? (Answer: 1516128876)

Prompt: multiply 38932 with 38943 and show the steps

ChatGPT Result: 1514996976 (which is not correct.)

This is of course not correct. So let’s try to ask it the same question, this time with the correct answer.

Only the last step is changed to agree with our answer, suspicious indeed.

2. Math ability of ChatGPT 4: fail in order to impress

Now let’s ask it another question: Is the answer 1515567896? (which is also not correct. )

Which confirms my suspicions that ChatGPT is really just trying to please us with what ever answer we want to hear. It is well known that LLMs try very hard to “please” its “users” to get positive result (e.g.: hallucinations), and this is just a mere confirmation of that.

Let’s just try it one more time, this time supplied with an even worse answer: 4515567892. The answer is clearly too big and the last number is not “6”.

It seems that there IS an end to ChatGPT’s people pleasing-ness — when the answer is obviously wrong, it does not try to conform to the obviously wrong answer. Side note: due to the chat history, it has reverted to the previous erroneous answer.
We shall start a new ChatGPT session without previous history for a second go.

3. Math ability of ChatGPT 4: different failure, same step

Generative AI and LLMs are not deterministic, which is a curse and blessing in deed. After starting a new session, I have asked ChatGPT4 the same question, I got essentially the same behaviour: the model made the mistake at the same place, with a different wrong answer.

This time, it seems that ChatGPT employed a fancy agent/tool to answer the question. Don’t let the facade fool you, for the answer is still wrong. Look at the last step:

     116796
+    1557280
+   35038800
+  311456000
+ 1167960000
-------------
  1515567896

Again, ChatGPT failed at long addition, but this time differently.

4. Generative Failure: Can we harness that?

So here are the examples of generative failures of ChatGPT. But do we really need to look at it with sneer and jest and nothing more? I disagree.

Here are a few ways I think we can put it to use:

for simple questions with single definitive answer, we can use it as indication of failure. e.g., simple math.
enough steps are provided so we can aggregate statistics to figure out location and type of error
proof construction, especially those beyond the “undergraduate homework questions” category, is laden with explorations, mistakes and futile attempts. It is good to explore and have different answers, as long as we can verify the steps.

The last part is the crux of the matter, and just as a quick preview, I consider metamath as part of the answer: https://us.metamath.org/

Metamath is a formal system and a software tool for developing rigorous mathematical proofs. It provides a framework for expressing mathematical definitions, axioms, theorems, and proofs in a precise, formal language. The primary goal of Metamath is to ensure that every step in a mathematical proof is justified by basic logical rules and axioms.

… and of course, I asked ChatGPT to write this well crafted blurb on metamath. Oh, the curse and blessings of ChatGPT.

Quest for Proof Automation: What can ChatGPT4 do? simple multiplication (part 1/N)

0. Intro

1. Math ability of ChatGPT 4: simple multiplication fail

2. Math ability of ChatGPT 4: fail in order to impress

3. Math ability of ChatGPT 4: different failure, same step

4. Generative Failure: Can we harness that?

Further reading:

Home Page - Metamath

Metamath is a simple and flexible computer-processable language that supports rigorously verifying, archiving, and…

Written by Yang Lu