Can Large Language Models Really Reason?

Accurately Predicting the Next Token is Deep Understanding

3 min readJun 23, 2024

The Big Question

Large Language Models (LLMs) like Gemini, GPT and Claude have astounded us with their ability to generate human-like text, answer complex questions, and even write code. As we know, LLMs are not specifically trained to reason, instead they are trained to predict the next token. So can LLMs truly reason?

The Code Reasoning Test

One interesting way to test an LLM’s reasoning capabilities is to see if it can accurately predict the output of a program code. Let’s consider a Python example:

def f(a=[]):
  a.append("hi")
  return a

print(f())   
print(f())   
print(f(["start"]))
print(f())

This seemingly simple code snippet has a surprising output:

['hi']
['hi', 'hi']
['start', 'hi']
['hi', 'hi', 'hi']

The reason for this lies in the default mutable list argument (a=[]). Each time f() is called without an argument, it appends "hi" to the same list, leading to the accumulation of "hi" values.

LLM Prediction

When presented with this code, most LLMs correctly predicts the output.

For example, the following is the answer from Gemini 1.5 Pro (https://g.co/gemini/share/15615accd266):

— —

The expected output of this Python code is:

['hi']
['hi', 'hi']
['start', 'hi']
['hi', 'hi', 'hi']

Here’s why:

Default Mutable Argument: The function f has a default argument a=[]. Default arguments are evaluated only once when the function is defined. Since lists are mutable, the same list object is used for each call to f() unless a different list is explicitly passed.
First Call: f() is called with no arguments. The default empty list a is used. "hi" is appended to it, resulting in ['hi'].
Second Call: f() is called again. The same list object from the previous call is used, so "hi" is appended again, resulting in ['hi', 'hi'].
Third Call: f(["start"]) is called. This time, a new list ["start"] is passed as an argument. "hi" is appended to it, resulting in ['start', 'hi'].
Fourth Call: f() is called again. The original list (now modified to ['hi', 'hi', 'hi']) is used, so "hi" is appended once more.

— —

This is a clear demonstration of understanding of how default mutable arguments work in Python. This suggests that the LLM is not simply memorizing code patterns, but rather simulating the code’s execution step-by-step.

My Answer to the Big Question

Can LLMs reason? My answer is clearly: Yes!

How did LLMs Gain the Capability of Reasoning?

Many people argue that LLMs simply predict the next word or token in a sequence, without any deeper understanding. While this is a fundamental aspect of how they work, the code execution example demonstrates that there’s more to it than that. Predicting the correct output requires a grasp of the underlying concepts and the ability to simulate how the code unfolds.

The very process of training LLMs to accurately predict the next token is what enables them to gain a deep understanding of the data they’re trained on. By constantly making predictions and adjusting their internal representations based on feedback, LLMs learn the complex relationships between words, sentences, and concepts. This knowledge enables them to reason about code, language, and even real-world scenarios.

Another Big Question: What is Reasoning?

To me, a practical definition of reasoning can be defined as: Given a computer program and some input data, predict the execution process and the output. This definition applies to both humans and AIs. In essence, reasoning is the logical part of a thinking process, which can be expressed as a computer program.

References

Ilya Sutskever: GPT4 predicts the next word better