Do LLMs Reason?
ChatGPT and similar LLMs would likely pass the Turing Test by anyone’s 1950 C.E. standards, when the test was first proposed. They are smart, funny, and provide relevant responses that appear to contain sound conclusions that look like reasoning. Machine learning has come a long way since 1950, and any sufficiently advanced technology is indistinguishable from magic[1], after all. But are LLMs actually reasoning?
Let’s start with the definition of reason (v): to think, understand, and form judgements by a process of logic. So, the question is: do LLMs think, understand, and form judgements about things logically? For the sake of argument, let’s simplify to: do LLMs think logically? Think is a loaded term, so let’s make the question more grounded in ML Land and finally simplify to: do LLMs generate their output through a logical process?
Of course, there is a trivial answer that neural networks (and computers) are built on many fuzzy (and discrete) logic gates. But unless you are willing to grant all computer programs the status of “reasoning entity”, then let’s agree that’s not the intended question. Rather, if we wish to understand whether LLMs have some reasoning process similar to how humans reason, then there is at least one argument and one example that suggest they do not.
LLMs are Parrots
In the strictest terms, LLMs are trained to reproduce the most likely words given the preceding words; the training process literally tries to recreate the input, character for character. Once trained, we can prompt LLMs for a response, and these prompts are nothing more than incomplete passages of text, as far as LLMs are concerned — they are simply mimicking the training process and trying to determine the words most likely to follow, given the data it has trained on and the provided context (i.e., the prompt). The models’ weight values are not learned by a process of logic but rather statistics and, therefore, there is no explicit connection between the model weights, logic, and the output.
As a counterpoint, it is fair to ask whether the neuronal connections in human brains are fundamentally learned by a process of logic. Like LLMs and artificial neural networks, the answer is most assuredly not. Our reasoning ability is an emergent property of billions of neurons and trillions of synapses, and so reasoning could emerge from a large enough neural network, but you still have all your work ahead of you to show that LLMs are at that point, especially given the description of their training process.
Words are meaningless in a vacuum — they are mere labels that are only as useful as the real-world significance understood by their recipient. LLMs only see how words are related to each other — they have no concept how the words relate to anything in the world outside the computer chips on which they reside. Humans only learn to reason as well as we do with extensive training in word-world association. Here is a relevant quote from Yoshua Bengio in December 2022:
…text generators are trained simply to predict the next word rather than to build an internal data structure that accounts for the concepts they manipulate and how they are related to each other.
It will probably take different architecture, multi-modal input (e.g., visual, tactile, aural, and lexical data), and different hardware to get LLMs to a place where they are reasoning or are thinking in a manner less distinguishable from humans. For now, it’s fair to say LLMs are just high-tech plagiarists[2].
An Example of Unsound Reasoning
To put the above arguments to the test, let’s look at a concrete illustration of where LLM reasoning seems to break down. I asked ChatGPT (GPT-3) two qualitatively identical questions:
(1) Is 7309 a prime number?
(2) Is 7311 a prime number?
The answers are yes and no, respectively. 7309 is not divisible by any numbers except 1 and itself, and 7311 is obviously divisible by 3, because the sum of the digits is divisible by 3 (i.e., 7 + 3 + 1 + 1 = 12). The reason I chose these two numbers is because I figured there can’t be many (if any) examples in the training data of specifically 7309 and 7311 being talked about in relation to prime numbers.
Let’s see how ChatGPT responds to the first question.
Amazing! The overall structure of the response is impressive; it first states the definition of a prime number, describes a common, logical algorithm for determining whether a number is prime, notes an efficiency (i.e., only look at numbers up to the square root), and then states the correct answer.
Before we address the ostensible reasoning of the response, let’s look at the second prompt.
Here we see the same response structure as the first question, but the answer it provides is incorrect. How did it get the answer so wrong? It presents essentially the same (ostensibly correct) reasoning — did ChatGPT not use the reasoning it provided?
No, it did not. Again, LLMs are parrots.
Let’s go back to the reasoning steps in a bit more detail. Notice that the details of the reasoning are incorrect, which is consistent with a process that understands how to reconstruct the general framework of an answer to a similar question but has no idea how to implement the instructions contained within. In each case, incorrect square roots were given for 7309 and 7311 (they should be 85.4927 and 85.5044). So, the numbers are only slightly off — no big deal, right? Except that should be the first clue the LLM is not doing what it says. We can cede that point for the sake of argument, but then why does the list of “numbers” (never mind that it didn’t say “prime factors”) worth trying go up to 89 in each case? Didn’t it just say we should only go to the square root (which should be 85, not 86)? For 7311, did it not actually try to divide by the second number on that relatively short list (i.e., 3)? No, it did not — it cannot; LLMs don’t do math, they only generate new words and characters to fill in your blanks. The LLM has learned how to reproduce the answers to math problems, without learning how to do the math.
Summing Up
What about prompting techniques that seem to improve reasoning? Chain of Thought prompting ostensibly improves the reasoning performance of LLMS, helping them take a conversation from A to B more completely and correctly. But once we realize that step B is actually step Z, we realize we are just helping LLMs through smaller leaps in the language sequence. The words in A lead to the words in B, lead to the words in C, etc. Breaking down each component of a proper answer gives the Parrot a better shot at being able to complete subsequent portions of the text.
LLMs will say a lot, but they don’t mean them, nor implement any of the logic they ostensibly display. If they’re not implementing ideas or rules in their responses, then they don’t understand what they’re saying, and they can’t be reasoning. The Chinese Room thought experiment applies: LLMs have a ton of training data to suggest how to respond to (or rather, complete) preceding blocks of text, with no sense of meaning or semantics.
Footnotes
[1] Arthur C. Clarke (1962). “Profiles of the Future: An Inquiry into the Limits of the Possible.”
[2] Noam Chomsky. On Artificial Intelligence, ChatGPT. https://www.youtube.com/watch?v=_04Eus6sjV4
Other Relevant Links
https://www.youtube.com/watch?v=YAfwNEY826I (Yann LeCun: Can Neural Networks Reason?)
https://open.spotify.com/episode/1cDx1urFBxA5TdXQnG7Ds6?si=BgEzaE6_TMmiSyVXa8wH0A (Basically everything I wrote, and more!)