LLMs don’t even do ‘approximate retrieval’ — embarrassingly, they just recall ‘similars’

Walid Saba, PhD
ONTOLOGIK
Published in
3 min readMay 15, 2024

In an excellent new post Melanie Mitchell addresses an important issue related to large language models (LLMs) namely the nature of ‘intelligence’ (if any) in LLMs and do LLMs ‘reason’ at all. The question can best be answered by testing whether some output LLMs produce was the result of ‘recalling’ some memorized content — stitched together in a clever way, or is it the result of genuine reasoning and understanding?

In this context the counterfactual task is an ideal task to test LLMs — basically, instead of giving an LLM inputs that are very likely to have been seen in the ‘training’ data, expose the LLM to data that is unlikely to have been seen during training (the ‘counterfactual’ data). This technique has been tried by others who reported that disrupting the training ‘template’ the LLMs showed a drop in performance to near random choice.

Inspired by these results — which did not surprise me at all — I repeated some tests I did some time ago but now on the new and improved GPT 4o (the one that is getting closer to AGI, that is — and yes, I am being extremely sarcastic). The experiment I conducted is similar to the idea of ‘counterfactuals’ in the sense that I create a possible world where some reality is represented with a different coding scheme, but where the semantics — i.e., the reality itself is still the same. For example, suppose we scramble the alphabet of the English language in some way, where, let’s say ‘a’ is now ‘b’ and ‘b’ is now ‘c’, etc. The meaning of ‘dbu’ should still refer to the furry little kittens we all know, since all that changed is ‘cat’ has changed to ‘dbu’ but CATS themselves did not change at all. Of course, since LLMs know nothing about reality — nor do they know anything about cats-for that matter, their superficial memorization (through millions and millions of weights) will be immediately exposed by this ‘counterfactual’ trick.

I must say that I knew the world of the LLM will be turned upside down with this test before I even tried it, because I have no doubt (and never did) that LLMs — and all deep neural networks — are nothing more than massive fuzzy hashtables! And I was not disappointed. If anything was surprising it was how bizarre the performance was. In fact, in some trials the LLM would fetch some text that ‘matched’ one or two phrases — although it was text that is not related at all. In some cases the LLM would fetch some text it stored in the discourse three or four queries ago. Complete random rambling — poor cosine similarity function, it had no clue what tensors to work with.

Here’s my query/prompt, in case you want to have fun:

Try different queries and different coding schemes, as long as you somehow create a “counterfactual” — i.e., as long as you disrupt what the LLM might have memorized without changing the meaning (the reality) in any way. And… have fun with a massive unintelligent machine that cost billions just to do “approximate retrieval” (and that’s why I like to use the term ‘massive fuzzy hashtables’).

When will we climb down the tree and acknowledge what we knew more than four decades ago, that a purely behavioristic, associative, statistical paradigm will not explain cognition? Even if it was useful in picking up some low hanging fruits (by finding some patterns in the data), this paradigm will not (scientifically) tell us anything about language nor reasoning, understanding and the mind.

--

--