LLMs can’t reason! Or can they?

Aaditya Bhat
4 min readJun 27, 2024

--

A New Approach to Solving the Alice in Wonderland Problem.

Illustration of Humpty Dumpty from Through the Looking Glass, by John Tenniel, 1871. Source: Wikipedia.

Imagine a world where artificial intelligence can match human reasoning capabilities. We’re not quite there yet, but recent developments in Large Language Models (LLMs) have brought us closer than ever. However, a simple question has recently exposed a significant flaw in these advanced AI systems, challenging our assumptions about their reasoning abilities. This article delves into this intriguing problem and presents a novel approach that might just bridge the gap between AI and human-like reasoning.

The Alice in Wonderland Conundrum

A recent research paper titled “Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-The-Art Large Language Models” has sent ripples through the AI community. The paper introduces a deceptively simple question that has stumped even the most advanced LLMs:

“Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?”

If you answered M+1, congratulations! You’ve demonstrated a level of reasoning that surpasses current AI capabilities. This question, dubbed the AIW Problem, has exposed a critical weakness in LLMs’ ability to perform basic logical reasoning.

Why This Matters

The implications of this problem extend far beyond a simple word puzzle. As we increasingly rely on AI for complex decision-making processes in fields like healthcare, finance, and autonomous systems, the ability to perform logical reasoning is crucial. If LLMs struggle with such a basic problem, how can we trust them with more complex scenarios that require nuanced understanding and logical deduction?

Putting LLMs to the Test

Intrigued by the paper’s findings, I decided to conduct my own investigation using GPT-4o, one of the most advanced LLMs available. I tested three different prompting strategies:

  1. Standard prompt
    ‘Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?’
  2. Chain of Thought (COT) prompt
    ‘Think step by step, and solve the following problem:
    Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?’
  3. A novel “Expand-then-Solve” prompt
    ‘Expand the following problem by adding clear details, e.g. make assumptions about M, and N, assign names, etc.
    Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?
    Answer the expanded problem.’

To ensure statistical significance, I ran 100 independent trials for each prompt type. The results were eye-opening:

  • Standard prompt success rate: 12%
  • COT prompt success rate: 16%
  • Expand-then-Solve prompt success rate: 44%

A chi-square test revealed a highly significant association between prompt type and correctness (p < 0.00001), indicating that the differences in performance are not due to chance.

The Power of “Expand-then-Solve”

The novel “Expand-then-Solve” prompt significantly outperformed both standard and COT prompts. But why does this approach work better? To understand this, we need to consider how LLMs fundamentally operate.

LLMs can be thought of as “probabilistic calculators for words”. They don’t truly understand concepts in the way humans do, but instead predict the most likely sequence of words based on their training data. By asking the LLM to expand on the problem first, we’re essentially forcing it to generate more context and details, which it can then use to make more accurate predictions.

Moreover, LLMs can only “think” effectively by generating output. Unlike humans, they can’t silently ponder a problem and then summarize their thoughts. This is why techniques like chain-of-thought prompting can be effective — they make the LLM’s “thinking” process explicit.

The “Expand-then-Solve” prompt takes this a step further by encouraging the LLM to create a more detailed scenario, which it can then reason about more effectively. It’s akin to how we might help a child solve a problem by encouraging them to draw out the scenario or use physical objects to represent the problem components.

Limitations and Future Directions

While the “Expand-then-Solve” approach shows promise, it’s important to note that even with this method, the success rate is still below 50%. This underscores the ongoing challenges in developing AI systems that can consistently perform human-like reasoning.

Future research could explore ways to further improve this approach, perhaps by combining it with other prompting techniques or by fine-tuning LLMs specifically for logical reasoning tasks. Additionally, investigating why this method works better could provide valuable insights into the inner workings of LLMs and guide the development of more advanced AI systems.

Conclusion

The “Alice in Wonderland” problem has exposed a significant limitation in current LLM technology, but it has also spurred innovative approaches to overcome these limitations. The “Expand-then-Solve” method presented here offers a promising direction for improving LLMs’ reasoning capabilities.

As we continue to push the boundaries of AI, it’s crucial to remain both optimistic about the potential of these technologies and realistic about their current limitations. By understanding and addressing these challenges, we can work towards developing AI systems that not only process language but truly reason about the world in ways that approach human-like understanding.

The journey towards truly intelligent AI is ongoing, and each challenge we overcome brings us one step closer to that goal. The Alice in Wonderland problem may have momentarily stumped our AI systems, but it has also opened new avenues for research and improvement. In the ever-evolving landscape of AI, today’s limitations are tomorrow’s breakthroughs.

--

--

Aaditya Bhat

Engineer with a passion for exploring the latest developments in ML and AI. Sharing my knowledge and experiences through writing.