Here is why ChatGPT lacks Cognitive Consonance for Context Extrapolation.
Humans bring the innate ability to look at two similar ideas or ‘truth-hood’ and equate their equivalence with being contextually or logically identical. This ability is called “True Equivalence” or “Cognitive Consonance”.
The polar opposite of ‘Cognitive Consonance’ is, of course, ‘Cognitive Dissonance’. Cognitive dissonance is when you establish False Equivalence of two contextually dissimilar ideas and believe it to be true.
It is important to remember that the absence of cognitive consonance does not mean the system demonstrates cognitive dissonance. It only means the system exists somewhere on the spectrum where it cannot relate two similar ideas.
My biggest qualm with LLMs is that they cannot exhibit cognitive consonance.
Example of Cognitive Consonance
Here are two mathematical formulas.
- ∑N a(zi)⋅(log a(zi)−log c(zi))
- ∑N p(xi)⋅(log p(xi)−log q(xi))
If you ask a reasonably bright 7th grader if these two formulas are the same, you will get the correct answer. (Yes, they are the same. They are mathematically equal.) This is cognitive consonance.
You know that the names of the variables are irrelevant. The first formula does not follow the statistical convention of label naming, while the second formula follows a de-facto convention for variable naming.
Let’s run this by ChatGPT:
Now if you ask a reasonably sensible system to explain these formula, we would like to hear what these formulas as an “idea” represent. Not to describe these formulas ‘verbatim’. This is just what happened with this first formula.
On the other hand, when you run the second formula by ChatGPT. It clearly now regurgitates the idea behind this formula.
- It now recognizes that this is a KL Divergence formula.
- It now moves the conversation toward relative entropy and how we use this to measure the difference between two probability distributions that are based on the same independent variable.
- It explains that the way to compare two functions that generate two possible probability distributions can be measured on their relative entropy using this formula.
Do you wonder why changing the variables’ names suddenly made ChatGPT sound like a statistical oracle?
Well, because it is a regurgitation engine without any cognitive abilities that have learned the representation of the world based on “tokens”.
Some preamble to understand how LLMs make sense of the world.
Let’s understand this in layman’s terms. LLMs are “token” based systems. What do I mean by that? A token-based system learns the representation of the world based on the symbols that are used to describe the world.
In the case of LLM, the tokens are “partial words” or “whole words”. For example, a token size of 4 means that you understand the sentence by breaking it into four characters at a time (inclusive of punctuations, spaces etc.)
Let’s look at the following sentence:
“And the cat’s in the cradle and the silver spoon. Little boy blue and the man in the moon.”
Here the tokens are as follows: [And ], [the ], [cat’], [s in], [ the], [crad], [le a], [nd t]… etc.
This is one schema. Other schemas can be based on whole words, or token size of 5, 6, 7 etc…
There are 2 ways LLMs make sense and provide contextual meaning to the tokens.
- Token Embeddings
Token Embeddings: The “meaning” of the tokens are provided by a token embedding vector that measures the importance of the words based on it’s cooccurrence of these words relative to other words on “N-hop”.
In other words, what is the cooccurrence of a given word next to the immediately adjacent word, the word which is next to the adjacent word, the words which are 3 words away in the sentence from the current word etc.. “Given a large amount of knowledge contained in a text corpus”.
So technically, a token is a representation of its word embeddings from a vector space.
The cleanliness, biases, content, knowledge or information in the text corpus MATTERS A LOT when you train a LLM, as this is where it learns it’s representational world views from.
For an example of what a ‘hop’ means in the sentence, “And the cat’s in the cradle and the silver spoon. Little boy blue and the man in the moon.”;
- The token [cradle] is 3 hops to the right of the token [cat’s]. So we call this +2. (We count from zero in computer science)
- ‘cradle’ is 3 hops to the left of ‘silver’. So this is concurrence for words which are at a distance of -2 hop.
Attention: Attention is a mechanism that allows a deep learning model to selectively focus on certain parts of the input sequence of tokens (words) while processing it. The idea behind Attention is inspired by how humans focus on different parts of a scene when processing visual information. For example, we tend to focus more on certain words that convey important information when reading a sentence.
In the above pic, Attention helps disambiguate the context for “it” in the sentence to suggest what ‘it’ means.
Note that Attention is learned ONLY after each token gets it’s context score from the token embeddings based on cooccurrence. So the MOST important input for any LLMs, first and foremost, is the cooccurrence.
Why did ChatGPT fail to demonstrate cognitive consonance?
Given the preamble, this should be obvious by now. Nope? Let’s break it down.
As we established, LLMs understand the world based on tokens, word embeddings (cooccurrence), and the importance of a word or token in a sequence (Attention).
If a large portion of how this world works is fed through a “textual corpus” of tokens that are based on certain conventions, then not adhering to these conventions changes the importance of tokens and the idea it may represent.
This is exactly what happened in the Math formulas.
A large portion of probablity and statistics uses a de-facto convention for variable naming.
- ∑N a(zi)⋅(log a(zi)−log c(zi)) : This is NOT a convention for variable naming.
- ∑N p(xi)⋅(log p(xi)−log q(xi)): This is!
Note that as soon as I changed the names of variables in KL Divergence, ChatGPT literally failed to recognize that the idea behind this formula is to compare two probability distributions.
But as soon as I use the convention, it ‘regurgitates’ a bunch of memorized tokens. (This is a very negligent statement on my part. There is no memorization that is static. Instead, think of it like a dynamic working memory which is a very humungous cloud of floating point vector space where all the cooccurrence and attention weights are stored).
LLMs are impressive in it’s ability to regurgitate information it has learned with a finitely-infinite amount of variability. This can make you believe that it is conscious or has cognitive abilities. This notion is called the “Uncanny Valley” where the system exhibits consciousness or cognitive superiority similar to humans.
But it does not. It cannot ‘yet’ do well on very many cognitive capabilities that are natural to a 8 year old.
‘Yet’ is the keyword here.