Large Language Models: Reasoning Capabilities and Limitations

10 min readMay 10, 2023

Large language models (LLMs) have demonstrated impressive capabilities, and understanding their capabilities and limitations is essential for harnessing their full potential. While GPT has shown remarkable results just from scaling up in size, the underlying reasons for its success can be difficult to grasp.

Language models (LMs) are built as a kind of “autocomplete”, but thinking of it this way isn’t very helpful. It falls short of making clear how it was able to do all the things it was doing.

In this post, I’ll try to think through the capabilities and limitations of LMs, and I’ll do that by describing a helpful mental model, the “language games” mental model, which can help us understand the reasoning abilities of LMs.

Could a LM even be said to be “reasoning”?

How do we know if the text output by a large language model amounts to “reasoning”? Large language models can definitely show human-like reasoning capabilities, but how much of that is just a facade? How do we distinguish real reasoning capabilities from simple dumb imitation?

It’s important to understand reasoning and intention as descriptions for making sense of the behavior of something. They’re descriptions that are good insofar as they are useful for understanding behavior.

Here I’m aligned with the philosopher Daniel Dennett, who calls this the “intentional stance”, which emphasizes that it’s a way of understanding behavior. The “intentional stance” means describing the behavior of something in terms of its beliefs and desires.

Consider, for instance, a chess-playing program. It’s possible to look into how it works and the sorts of things it will be considering at the beginning of each move, but there’s also the description of its intentions: “it made that move because it wants to avoid going into check the next turn.” That’s a perfectly reasonable thing to say, and definitely helps us understand its behavior.

It can be asked whether this is a fair description, though, since LMs certainly don’t have feelings in the same way humans do. There may be similarities between how LMs reason and how humans do, but there are fundamental differences.

A good analogy to think of in this regard is airplanes. When we build machines that can fly, they do so very differently from birds. For a bird, feeling and responding to the wind is an important part of flying, but an airplane doesn’t feel the wind in the same way. They both fly, but in very different ways. Same with A.I.: it can do some of the same things as humans do, but it does it in a very different way, and that’s fine. There aren’t feelings behind the words that LLMs use, but that doesn’t mean they can’t write words that express an intention.

Another idea we can use, which also comes from Daniel Dennett, is the “sorta” operator. We can put the word “sorta” in front of a statement about an A.I.’s intentions to show that the description may break down outside of narrow contexts. The “sorta” operator lets us still call it “belief” or “desire”, despite the description falling short sometimes.

E.g. a chess-playing computer is “sorta” intentional, since describing its behavior through intentions works in some restricted domain.

Another example: when a kid says “daddy is a doctor”, do they understand what that means? Maybe to some extent, but not as richly as an adult would.

So are they truly “reasoning”, or just doing something that looks like it? That question is too big for this post, but I do hope you’ll allow me to say that they can “sorta” reason. As long as its imitations of reasoning looks correct enough in a narrow domain, we can say that it is “sorta” reasoning. By allowing ourselves to say that it is “sorta” reasoning, this means we can use a certain kind of description of how it behaves based on its beliefs and desires.

What forms of “sorta” reasoning can LLMs do?

Ok, so assuming we are comfortable saying that LMs can “sorta” reason, what are the limitations of that reasoning ability? And if LMs can truly “reason”, how do we understand the capabilities and limitations of that ability? What are the reasoning capabilities and limitations of LMs in general, and how do we determine the reasoning capabilities and limitations of a particular LM?

It’s not surprising that an LM would say something false. It’s surprising that it can be made to say something true with any amount of regularity.

This is probably the biggest place where mental models like the “database with interpolation” mental model falls short: how could a LM be made to provide true information with any regularity at all? In the “database with interpolation” model, there’s no way to guarantee it.

This mental model is good for getting an intuition for how LMs work, but the idea has its limitations. The reason this falls short is that it is treating true statements as simply data, not as knowledge. Shifting the mindset from thinking of true facts as “data” to “knowledge” is what helps to make sense of how to work with LMs.

There was a period of philosophy in the 1900s where thinkers were trying to figure out what language was, and how natural language was connected to math and logic. It was difficult because the language we use everyday is often vague.

Some sought to eliminate this vagueness in language. If natural language is ambiguous, then it needs to be clarified so it can fit into a true/false logical expression.

For example, Gottlob Frege, in a review of another philosopher’s work, wrote:

… when properly expressed, a thought leaves no room for different interpretations. We have seen that ambiguity simply has to be rejected …

-Gottlob Frege

So philosophers broke down language in the following way: linguistic acts have content and a “force”, which is to say they are an action. For example, “Rover is sitting” is an assertion, and “Is Rover sitting?” is a question. “Rover sit!” would be a command. All three have the same content, but have different “forces”, since they did very different things. Some philosophers sought to understand language by describing a taxonomy of all possible linguistic forces available.

In contrast, a philosopher named Wittgenstein approached things differently. He argued that we should think of language as analogous to playing a game. That game could be something like football, chess, or solitaire. He used this analogy to show how acting with language can be vague or dependent on the context, but still follow regularities and rules.

When playing a game, there are certain rules set in place for how to play, and often you’re working towards some objective. We should understand using language as if it were similar: we do things that adhere to rules and often do them in order to further some objective.

Wittgenstein argued that language being vague was not an imperfection, but a key aspect of how it worked. Sometimes language just is vague, and that’s because its role in a language game is vague.

He also argued that linguistic acts only have their “force” in context of the “language game” being played. This means that natural language does not adhere to some fixed taxonomy of actions. It’s impossible to list out all of the possible things one can do with language because the role of linguistic actions is contextual to a language game and is always a part of an activity.

Instead of thinking of LLMs as “running programs”, we should think of them as playing “language games”.

Language models are probabilistic and sometimes unpredictable, which makes them a poor choice for doing something that looks like running a program. But this is a strength, as they are able to work with natural language and all of its ambiguities.

An example of a task that isn’t particularly well-suited for LMs is solving a mathematical equation. An example of a task that would be well-suited for a LM is interpreting and rewriting mathematical formulas that might be written with an ambiguous or vague order of operations.

Since language is used in the context of a language game, we might say that trying to form new “prompting languages” is somewhat backwards. This is trying to force LMs back into the mental model of thinking of them as programs, as unambiguous instructions. These “prompting languages” often artificially limit LMs and don’t use them for what they are good at.

Let’s think through the “language as a game” metaphor some more.

Rules for a game that define some behavior are called constitutive rules. They define what those actions are in the context of the game. These typically are phrased as “when such-and-such occurs counts as ___”. For example, “having possession of the ball in the endzone of the other team counts as a touchdown”. These rules help define actions that are unique to the game.

Rules for a game that regulate the behavior that exists without them are called regulative rules. This includes social etiquette, and often is phrased using terms such as “ought”. For example, when playing chess, it’s considered polite to tell another player when they are “in check”. These rules don’t define the behaviors in the game, but they define the social context around which the game is played.

Constitutive rules allow for new forms of behavior in a game. It allows for a higher-level description of some set of behavior to be used. Without a constitutive rule to define that behavior as being a certain action, that higher-level description is not valid.

You can think of constitutive rules as in some senses defining a game. Since they help define the behavior that is available in the context of the game, understanding them is a key part of learning how to play a language game. LMs are able to learn to act according to these constitutive rules when you train them on examples of acting according to those rules.

This is why large language models like GPT-4 are good at incremental tasks, but bad at higher-level discontinuous tasks: it has tons of context on common actions, but far fewer on novel, or rarely seen, actions.

Regulative rules can be enforced via reward functions for the LM. Since they are not part of the language game itself, but are part of the surrounding context, they are best enforced using constraints defined outside of the strict rules of the game. That’s why it would be difficult to enforce them only through specific examples of the language game being played. This is what ChatGPT did especially well with its RLHF model.

Another solution that doesn’t involve retraining a LM is using prompt engineering. Prompt engineering can help instruct a LM on how to behave with respect to a language game. A large language model like GPT-4 isn’t going to “understand” exactly what you want when you ask for a particular thing, but it likely “understands” many different language game “moves” from everyday speech. If you want to teach it to play a language game, one way to do that is to provide specific instructions for how to play in terms of everyday language.

For example, see this Reddit post where someone crafts a prompt instructing GPT-4 to behave like Mickey Mouse. Many pieces of the prompt are constructed in order to prevent the end user from “jailbreaking” or hacking the chat and making it do something unintended.

What’s notable about this example is that many parts of the prompt look like explicit instructions for how to play a particular language game. It defines the “moves” that it’s allowed, and not allowed, to make, and the way it is meant to approach playing the game.

Prompt engineering is a short term solution to alignment issues, however. The long term solution is training the LM for that specific language game. Truly automating a task with A.I. requires there to be no prompt engineering, since that would mean that there is task alignment. (Alignment in this case means that the A.I. has the correct desires while playing the language game.)

There’s more to say about the analogy of LMs as players of “language games”, but the above helped me understand the capabilities and limitations of LMs, especially in contrast to other forms of language-based AI. They’re better suited for manipulating language when there is some amount of ambiguity or vagueness present. They operate according to rules, but those rules are best not thought of as programs they are running, but rather as games they are playing.

Large Language Models: Reasoning Capabilities and Limitations

Could a LM even be said to be “reasoning”?

What forms of “sorta” reasoning can LLMs do?

Written by Karl