Large Language Models don’t have a strong model of the world, and that’s fine

11 min readJun 11, 2023

Understanding the purpose and inner workings of Large Language Models (LLMs), such as ChatGPT, LLaMA, or Bard, can be challenging. This has sparked passionate discussions and reactions across society. In this context, I would like to share some thoughts on how close LLMs are bringing us to Artificial General Intelligence (AGI) and explore a few key questions:

Have we discovered the definitive architecture, enabling computers to tackle any problem an animal can solve?
If not, what are the limitations of LLMs? What are their strengths?
Are these limitations inherent, and can we circumvent them?
Are LLMs a dead-end, necessitating a return to the drawing board?

The perspectives I share may be considered obvious to some and off the mark to others. Feel free to comment anyway, learning from you is all the payback I need.

Let’s move to the first hypothesis I would like to suggest: current state-of-the-art LLMs, particularly those based on the transformer architecture, parallel animal intuitive reasoning rather than analytical and deductive reasoning.

To illustrate this, let’s consider the way we count objects. E. L. Kaufman and colleagues observed in 1949 that our approach varies with the number of items present. For more than four objects, we count them sequentially, with an algorithm similar to:

Search for an item not yet counted:
  - If found, increment the total by one;
  - Else if no uncounted item remains, return the total and halt.

This procedure takes longer as the number of items increases. Conversely, with up to four objects, we seem to intuitively recognize the number in an almost instantaneous and unconscious process, termed subitizing by Kaufman and colleagues.

The figure below demonstrates this phenomenon: we can instantly recognize four red dots, whereas accurately assessing the number of blue dots requires counting.

Extracted from Wikipedia’s article “Subitizing”.

Subitizing is not a calculated, or analytical process, but an intuitive one. It’s similar to how we recognize colors and shapes, relying more on recognition than on following a strict procedure. This is where the definition of “intuitive reasoning” becomes clear: intuition is in essence recognition. Herbert Simon summarizes this perfectly:

The situation has provided a cue; this cue has given the expert access to information stored in memory, and the information provides the answer. Intuition is nothing more and nothing less than recognition.
Herbert Simon, 1992, “What Is an ‘’Explanation’’ of Behavior?”

In simpler terms, intuitive reasoning is about identifying patterns in our current observations that match past experiences. This process appears straightforward because it doesn’t involve lengthy, complex thought chains; instead, it’s about finding a correlation with something stored in our memory.

Returning to the example, one might question why we count at all. Why not use subitizing at all times, given its quick and efficient nature? Here are three possible reasons:

Subitizing, as a method based on past experiences, might not generalize to quantities we haven’t seen before.
Subitizing requires observing different item configurations in many varying arrangements for each quantity of interest. This variety allows us to identify perceptual regularities among different scenes with the same quantities. We learn to discount colors, shapes, positions, and other irrelevant dimensions. However, the required number of observations grows rapidly with the item count (since possible configurations increase exponentially), making intuitive recognition more challenging.
The counting algorithm aligns closely with the axiomatic mathematical definition of a number. Counting can thus be seen as an introduction to mathematics, whereas subitizing cannot. This mathematical door not only unlocks the concept of numbers but also other arithmetic concepts (like subtraction, multiplication, and exponentiation).

For an experiment, let’s ask ChatGPT (based on GPT3.5) to count the number of words in a piece of text:

In this instance, ChatGPT incorrectly counted the words (the correct answer is 34). This outcome is consistent and often the case when the text contains more than 20 words. Even the newer version, based on GPT4, while it may count correctly in this instance, tends to fail with slightly larger word counts. This shows that ChatGPT doesn’t have a robust way to count words and lacks a systematic procedure applicable to any number of items. In short, it seems that ChatGPT’s approach is more akin to subitizing than counting.

This points to the question raised by David Chalmers:

If LLMs lack a model of something as fundamental as numbers, how can we expect them to build a coherent model of the world?

This deficiency in strong world modeling is frequently called out by researchers. Here by Rodney Brooks in a recent IEEE Spectrum Interview:

It sounds like you don’t think GPT-5 or GPT-6 is going to make a lot of progress on these issues. Brooks: No, because it doesn’t have any underlying model of the world. It doesn’t have any connection to the world. It is correlation between language.

Which brings me to the second idea of this post: while LLMs possess a weak model of the world, limited to text modeling, it isn’t detrimental. They’re emulating intuitive reasoning quite effectively. We shouldn’t expect them to surpass these capabilities. Instead, we should focus on what they do well, as it’s a vital part of the puzzle. To build a strong model of the world, we should use different tools.

Recalling Moravec’s paradox

In this section, the key point is that even without considering LLMs, computers already excel at modeling the world. Sometimes, they’re even superior to animals in this respect, which is a primary reason we use them. Seeking robust world models from within LLMs might be unnecessary.

An engaging way to illustrate this is by remembering Moravec’s paradox:

It is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.
Hans Moravec, 1988, “Mind Children”

This statement seems vastly different from what’s currently considered the major AGI bottleneck. Back then, replicating animals’ perception and motor skills in a computer was seen as the challenging task. Conversely, there were already successful attempts at replicating abstract reasoning abilities (playing chess, performing mathematical calculations, etc.).

In 1988, for example:

The question wasn’t whether a chess program could beat the average player, but whether it could defeat the world champion;
We had efficient software compilers due to advancements in formal language modeling (e.g., context-free grammars);
ANSYS, Matlab, and various other software programs were available to assist with mathematical and physical simulations.

While this list is not comprehensive, the key point is that significant achievements had already been made by the 1980s. Computers demonstrated their capability in managing complex world models and providing valuable information. This process has not disappeared with the arrival of LLMs; if anything, computers’ incredible modeling power is even more evident today.

So, to summarize:

Computers are good at handling complex world models and utilizing these models for complex analytical reasoning. This isn’t a new development.
Since 2012, computers have been making rapid advances in perceptual tasks and intuitive reasoning thanks to Deep Learning, even reaching human-level performances in some areas.

So, are we done? Why are there ongoing debates about LLMs lacking a world model?

I believe this relates to a significant upcoming transition: the notion that a single, all-encompassing Neural Network can solve complex problems may be fading. Although this idea has been a source of tremendous progress in the past, it may not be the universal solution we once thought it was.

The end of the end-to-end Neural Network paradigm

Over the past decade, one pattern related to deep learning has emerged as a significant driver of progress: the shift towards “end-to-end” Neural Networks. To understand what this means, let’s look at the field of image recognition before 2012, before deep learning became prevalent. At that time, a standard image recognition pipeline was organized as shown in the diagram below:

Extracted from “Deep Learning for Vision Systems” by Mohamed Elgendy

If we focus on the algorithmic steps, we mainly encounter three procedures:

Preprocessing: The image is standardized to eliminate some unnecessary variability in its content. This typically involves operations like up/down-sampling, camera calibration, color-correction, etc.
Feature extraction: The most prominent parts of the image (key points) are identified, and a vector representing its visual content is computed from each one.
Classification: A “minimalist” classifier is applied to the high-level features provided by the feature extraction process. This classifier could be a Neural Network or another type of classifier (e.g., Support Vector Machine, Naive Bayes). The critical point is that since feature extraction already provides high-level cues from the image, a simple classifier can yield good results. This part is essentially what you learn, but it’s relegated to a tiny portion of the whole algorithm.

This three-step view is a high-level overview. If we examine one level of granularity below, the computation sequence becomes even longer and more complex. For instance, here’s the pipeline to obtain one of the most widely used feature descriptors (SIFT):

I won’t delve into the details and explain each stage. What’s crucial is that in the past, the best image recognition algorithms were based on lengthy sequences of explicit and predefined computations, designed by experts.

This all changed in 2012 when Krizhevsky, Sutskever, and Hinton applied a single Deep Convolutional Neural Network (CNN) to image recognition. They completely eliminated the feature extraction step and minimized the preprocessing step (to simple normalization of pixel values). The gamble was that these explicit computations could be replaced by equivalent processes “learned” by the CNN. This proved to be correct: in 2012, CNNs outperformed all other methods on the ImageNet benchmark.

To compare with the “classic” image recognition pipeline from above, here’s the representation of a ResNet-34, a small CNN used in image recognition:

Extracted from “Understanding and visualizing ResNets” by Pablo Ruiz

As can be seen, there are no explicit computational steps anymore, just layers of neurons. Their roles aren’t predefined but learned from a labeled image collection. Except for the last layers (i.e., Flatten, Dense, Softmax), all other layers are similar in structure, they’re convolutional layers and primarily differ in their size and how their weights are configured. This flexible structure could be used for various tasks, but when presented with labeled images it learns image recognition.

In summary, advancements have arisen from recognizing that, for complex perceptual tasks, it’s more efficient to allow machines to learn patterns in the dataset rather than meticulously crafting complex algorithmic procedures based on our field expertise.

Steven Wolfram also discussed this phenomenon in a recent paper:

In earlier days of neural nets, there tended to be the idea that one should “make the neural net do as little as possible”. For example, in converting speech to text it was thought that one should first analyze the audio of the speech, break it into phonemes, etc. But what was found is that — at least for “human-like tasks” — it’s usually better just to try to train the neural net on the “end-to-end problem”, letting it “discover” the necessary intermediate features, encodings, etc. for itself.
There was also the idea that one should introduce complicated individual components into the neural net, to let it in effect “explicitly implement particular algorithmic ideas”. But once again, this has mostly turned out not to be worthwhile; instead, it’s better just to deal with very simple components and let them “organize themselves” (albeit usually in ways we can’t understand) to achieve (presumably) the equivalent of those algorithmic ideas.
Stephen Wolfram, 2023, “What Is ChatGPT Doing … and Why Does It Work?”

What happened with state-of-the-art (SOTA) image recognition also applies to other fields. The table below presents two additional examples where this principle is at work:

This pattern has been tremendously successful. Furthermore, there’s something intellectually satisfying about replacing numerous distinct steps with a single, homogeneous structure. This approach satisfies the scientific pursuit of explaining and predicting more with less.

However, a decade of progress based on this principle may be reaching its limit. I really doubt we should attempt to fix Large Language Models (LLMs) by using more data, modifying their architecture, or increasing their size. To put it simply: there may be nothing to fix. Current LLMs are excellent at intuitive reasoning, why requiring them to develop a robust model of the world? in some regards, doing so would contradicts their fundamental logic:

LLMs are not Turing complete: they answer all questions in the same number of computational steps;
They diverge: their auto-regressive nature (i.e., predicting the next token based on the previous ones) causes them to diverge with the number of words, setting a hard limit on the length of reasonings that could occur.

We should utilize LLMs for what they excel at: intuitive reasoning. This is already an outstanding accomplishment, perhaps even exceedingly so. LLMs can indeed apply intuitive reasoning to areas where animals would never be able to, with considerable success. This extension of intuitive reasoning sometimes gives the impression that LLMs can think analytically, but this is not actually the case.

What to do then?

So, if extending Large Language Models (LLMs) to enable analytical reasoning isn’t an option, what should be our next step? To ponder on this, let’s recap our current position:

LLMs excel at intuitive reasoning (or pattern recognition), but they lack an analytical model of the world.
Computers, generally, excels at working with analytical world models through specialized software. They can generate impressive numerical and logical results that a human could never replicate with just pen and paper.

Could the solution simply involve connecting the two? While this may seem too straightforward, even inelegant (the thought of such a patchwork solution initially seems displeasing), on further consideration, what would be the limitation?

Let’s circle back to the initial example: assessing numbers. Here’s another attempt with ChatGPT to determine the number of words in a text, this time using the Wolfram Alpha plugin:

This is the correct answer, and there’s no reason to think it wouldn’t apply to larger texts. Wolfram Alpha uses a defined and deterministic algorithm that counts words individually. It truly has a numerical model and can reason analytically through it. Similarly, there’s no reason to believe this approach wouldn’t scale up to tackle much more complex problems, since most of the complexity would be managed by the analytical engine, leaving the LLMs with the simple job of extracting meaningful parameters and properly injecting them.

…with the simple job of extracting meaningful parameters and properly injecting them.

While this may not be as straightforward as it sounds, LLMs are exceptional at translation, and this extraction and injection task isn’t more than translating from a natural language to a formal one. Some recent experiments that enable the enforcement of context-free grammars into LLM outputs make me believe we are close to perfecting instruction forwarding.

In conclusion, it’s true that LLMs lack a robust model of the world. But is this really the issue? I think we can all agree that ChatGPT combined with Wolfram Alpha hasn’t revolutionized the world, nor can it be called an Artificial General Intelligence (AGI). But what’s missing? The answer isn’t clear, but that’s what makes this field so intriguing. Every time we reach a milestone, we realize it’s not sufficient to attribute “real intelligence” to computers, prompting us to ponder on what makes animals uniquely intelligent. Let’s continue this fascinating journey.

Large Language Models don’t have a strong model of the world, and that’s fine

Recalling Moravec’s paradox

The end of the end-to-end Neural Network paradigm

What to do then?

Written by julien Colafrancesco