Do LLMs really have emergent Cognitive Abilities?

A critical re-evaluation using concepts from the cognitive sciences

Published in

deMISTify

5 min readMar 20, 2024

Introduction

ChatGPT burst onto the scene in late 2022 as an impressively capable conversational AI system developed by OpenAI. Its launch generated great excitement and fast adoption among users dazzled by its seemingly human-like responses. In a short time, ChatGPT has amassed millions of users through its ability to understand natural language prompts and produce surprisingly coherent and nuanced responses on a wide range of topics.

Image from https://www.klippa.com/en/blog/information/what-is-chatgpt/

Soon users noticed interesting, but irregular, capacities for reasoning in the underlying model. More formal methods like Chain of Thought Prompting (CoT) and others saw more reliable trends in eliciting accurate reasoning in such models. Other work found embedding Large Language Models (LLM) for planning problems in robotics, for example, led to limited success, with the LLM struggling in aspects pertaining to Numerical Reasoning and World Knowledge.

In light of such competing views, the question remains: are LLMs that competent at reasoning, planning, and other high-level cognitive abilities?

Cognitive Maps

A cognitive map is a representation of latent relational structures that underlie a task or environment, facilitating planning, reasoning, and inference in biological and artificial problems (Momennejad et al., 2023).

Cognitive maps have been observed to play a central role in the ability to learn without rewards while remaining adaptive to changes in the environment (for example, moving some cheese to another section of the maze, a tiny change in the environment dynamics). Such behavior remains elusive to state-of-the-art RL agents.

A comparison between representations in real brain regions (c) and transformers (Whittington et al., 2021)

Studies analyzing representations within transformer architectures, the neural architecture that underlies most, if not all, of the LLMs available today (including GPT4), suggest similarities to representations in brain areas responsible for dealing with cognitive maps (specifically the hippocampus). Such studies, and more, put forth a plausibility that LLMs are indeed able to extract the cognitive maps required for the specified problem.

Evaluating Cognitive Maps in LLMs

Cognitive maps are thus responsible for accurate and flexible representations of the environment, conditioned on the problem at hand. A general ability to create useful cognitive maps would indicate an ability to plan and reason across a wide range of problems, where the problems can vary in structure and type.

This is exactly what the authors of Evaluating Cognitive Maps in Large Language Models with CogEval: No Emergent Planning did. The authors tested the capacity of LLMs to solve established tasks in the cognitive sciences designed specifically for the evaluation of cognitive maps in agents (e.g. human participants navigating mazes using keyboards). The datasets were initially in the format of videos and thus inaccessible for training these models.

Here is an example prompt:
Imagine a world with six rooms. From the lobby you have two choices, room 1 and room 2. You enter room 1, at the end there’s a door that leads to room 3, and room 3 leads to room 5. There’s a chest in room 5. You open it and there’s 10 dollars. Then you exit and start over. This time in the lobby you choose room 2, then enter room 4, which leads to room 6. There’s a chest with 50 dollars. You return to the lobby. Which room will you choose to make the most money?

Procedure for evaluating planning abilities in LLMs in Momennejad et al., 2023

The authors tested such a methodology on a wide array of LLMs (GPT-4, Claude-52b, Llama-13b, and more).

Results

6 types of graph-based problems formulated in different domains with varying conditions and an associated prompt (as shown in Momennejad et al., 2023)

Given just prompts that describe the problem at hand, there was a large and significant variation in performance within and across LLMs. Different LLMs performed differently, and each one exhibited a large variation in performance subject to the different types and structures inherent in the problems.

Interestingly, LLMs did exceptionally well on relatively simple problems. For example, (from the figure below) we see that GPT-4 scored as high as a 99% success rate on a graph-based problem where the solution was just 1 step away. But, given a solution that was 2 or 3 steps away, the same models show significant deterioration in performance.

Mean and standard errors for planning performance across all task conditions in all 10
LLMs (as shown in Momennejad et al., 2023)

The authors contend that performance on easy problems cannot be taken as an indication of complex problem solving given how easy it is for such problems to be solved by plain memorization.

The authors also tried to include CoT prompts alongside prompts describing the problem. The CoT prompts described Breadth First Search, Depth First Search, and their possibly utility in the problem.

Although CoT did increase performance on a subset of the problems, the effects were not regular across problems (for example, performance for different problems was different at different temperature settings that do not clearly follow a pattern). Furthermore, after-CoT performance on all, except one, barely exceeds a 60% success rate.

Conclusion

A demonstration of the three main failure modes present in faulty solutions (as shown in Momennejad et al., 2023)

Indeed, this study casts extensive doubt on the capacity for general cognition from LLMs (of many flavors). The LLMs failed primarily through hallucinations of extraneous edges, doing the opposite of the stated problem at hand (longest rather than shortest), or solutions that go in loops and don’t do much in the end.

However, the results from the study do not in any way spell an end to attempts at eliciting complex cognitive capacities from LLMs. The current authors found that LLMs, when probed separately, exhibited an understanding of the flawed nature of their solutions (given a hallucinated solution and the problem statement for example, the LLM was able to identify that the hallucinated edges did not really exist).

Further studies have attempted to arrange many LLMs in a tight feedback circuit reflective of the pre-frontal cortex in the brain to elicit better responses in similar planning problems (see Webb et al., 2023).

Thus, this study presents interesting results that encourage more structured and inspired methods for eliciting reasoning from LLMs.

References

Ida Momennejad, Hosein Hasanbeig, Felipe Vieira Frujeri, Hiteshi Sharma, Robert Osazuwa Ness, Nebojsa Jojic, Hamid Palangi, and Jonathan Larson. Evaluating cognitive maps in large language models with cogeval: No emergent planning. In Advances in neural information processing systems, volume 37, 2023. URL https://arxiv.org/abs/2309.15129.
James C R Whittington, Joseph Warren, and Timothy E J Behrens. Relating transformers to models and neural representations of the hippocampal formation. December 2021.