Will ChatGPT Ever Be Good at Math?

A look into how language models solve math problems

Published in

Byte-Sized Insights

6 min readJun 6, 2023

With the rise of AI chatbots like ChatGPT, an existential question has also been raised over the extent AI can replace the knowledge worker.

Knowledge Workers and Math

Knowledge workers include the portion of the labor force whose main utility is their aptitude in deploying logic to solve problems and make decisions. This working group includes lawyers, bankers, consultants, academics, and analysts, most broadly.

A knowledge workers’ instrument may vary from Excel to math to the written word itself. What unites all knowledge workers is their source of value, which is their ability to conduct analysis and construct logical arguments.

For many knowledge workers, daily work consists of translating a scenario and set of assumptions into some type of mathematical model.

A consultant translates assumptions on market size into a revenue model. An investor banker translates a series of financial statements and assumptions on growth into a discounted cash flow model.

Essential Question

When considering whether AI can replace the knowledge worker, one essential question is the extent AI can understand and produce mathematical reasoning.

If AI can build mathematical models as previously described, there is a tangible possibility the role of many knowledge workers is reduced to that of fact-checker — someone who validates an AI’s answer, but doesn’t do analysis themselves.

ChatGPT Math

To begin probing into this question, I started by asking ChatGPT to explain its mathematical capabilities, and I specifically asked the chatbot to explain the types of questions it gets wrong.

The chatbot rattled off a few types of problems it struggles with, including complex mathematical proofs and brain-teasers to name a few.

Prompt: What are some simple math problems you know you would get wrong?

However, I realized this list is far from comprehensive of ChatGPT’s mathematical shortcomings.

After testing a few basic math questions, I asked ChatGPT to tell me the sum of the first ten digits of pi (the answer is 39).

Instead the chatbot provided the following solution:

Prompt: What is the sum of the numbers in the first 10 digits of pi?

The logic checks out — take the first ten digits of pi and then add them up one by one — but the actual calculation was wrong.

The Flaw in Language Models

The truth is ChatGPT and chatbots struggle to solve mathematical problems due to the fundamental disparity between how language models work and the nature of mathematical solutions.

Language models work probabilistically, using their own prior output and training data to generate responses. Mathematical solutions however only have one correct answer. Since the output of a language model is determined probabilistically, and mathematical solutions only have one solution, ChatGPT and other language models can often predict the wrong answer.

Let’s break it down with the ten digits of pi problem above.

Rather than solving the summation 3 + 1 + 4 + 5 + 9 + 2 + 5 + 5 +3 deterministically by adding up the numbers one by one, ChatGPT was instead asking itself a different question:

“Given the prior words in the output are “3 + 1 + 4 + 5 + 9 + 2 + 5 + 5 +3 =”, what should the next word be?”

To answer this question, ChatGPT consoled its training data, which likely lacked this exact summation. Forced to extrapolate, the model generated a best guess of 38, one shy away from the true answer 39.

Will Language Models Improve?

There’s an assumption held by the general media and public that ChatGPT, like some gelatin data monster, will automatically grow more powerful as it gobbles more training data.

This assumption, though relevant, only captures part of the story, particularly when related to language models’ ability to solve mathematical problems.

In reality, given the fundamental disconnect between how language models naturally work and the nature of mathematical solutions, the underlying mechanisms of language models must change for us to see radical improvement in their mathematical capabilities.

In the near future, mathematical improvements in language models will likely stem from enhanced methods of prompting.

The Prompting Elephant in the Room

Prompting refers to the manner in which questions or queries are posed to language models, and its counterpart, prompt engineering, refers to the refining of these queries to improve language models’ outputs.

Research in this field is blossoming, largely coming from AI heavy-weight, Google.

Google’s scientific-based language model, Minerva, being trained on scientific data, serves as a good baseline model to understand the benefits of prompting on quantitative reasoning problems.

Researchers have tested several problem-solving techniques with the language model, outlined as follows:

Few-shot prompting

Few-shot prompting is the AI equivalent of giving a language model practice problems. Rather than asking the language model point-blank what the answer to a problem is, one can pre-prompt the model with similar questions.

Let’s say we want a language model to give us the circumference of a circle with diameter eight in terms of pi. We may give the model a few examples to start off with, such as giving the circumference of a circle with diameter of four or diameter of two.

The goal of this prompting is that the model will learn the pattern of the problem’s solution and use the pattern to solve the problem, though some solution patterns can be difficult to learn.

Chain of thought prompting

Chain of thought prompting builds on few-shot prompting by incorporating the process of solving the problem into the pre-prompts given to the language model. In the context of the circumference problem, the pre-prompts would include details of how circumference is calculated as the diameter times pi. These extra details help the model learn the rule explicitly.

Scratchpad prompting

Scratchpad prompting is the AI equivalent of making a language model show its work. Rather than asking for a simple answer, one asks the model to outline each step in its solution process. In turn the model can better catch its mistakes.

In the context of the language model, these methods improve mathematical performance by increasing the probability that the model output aligns with the mathematical solution.

By taking into account the context of the problem and the solution process, the model is more likely to arrive at the correct answer.

We can likely expect language models to continue improving as prompting methods are refined, especially as prompting techniques become more tailored for specific types of problems. Though this improvement may very well have a limit.

Conclusion

In their final remarks, the researchers examining Minerva’s performance noted that the model is far from rivaling human ingenuity in solving problems. Prompt engineering, however effective in tuning language models, likely cannot make up this deficit.

Consequently, the role of knowledge workers, at least for the time being, remains intact.

That being said, they may benefit from learning prompt engineering.

Some Final Thoughts

If you’re reading this article, your world is likely going to be disrupted by AI.

We don’t know what the changes will be, but they’re certainly coming.

At Byte-Sized Insights, we believe everyone should be informed of AI’s advancement, regardless if it’s good, bad, or sideways. Each week we post Medium articles on progress made in the technology and business of AI.

They’re simple, quick, and designed for you, the people.

Follow using the link below.

Click here to follow Byte-Sized Insights