ChatGPT is no stochastic parrot. But it also claims that 1 is greater than 1.

Konstantine Arkoudas
21 min readJan 15, 2023

--

Introduction

Let me begin by stating without qualification that ChatGPT is a technological marvel and an extraordinary achievement. It is the first human-made system with the ability to carry out short conversations on arbitrary topics with a level of fluency and coherence that is nothing short of stunning. That’s a very significant milestone.

Fluency is the (relatively) easier of the two properties. It is not terribly hard to spout sentences that are syntactically well-formed but nonsensical, and until very recently all LLMs (Large Language Models, the technology underpinning ChatGPT) did just that with an amusing degree of regularity. Coherence is the real challenge — staying on topic, making sense, adhering to conversational maxims, and displaying a general understanding of how the world works. And here ChatGPT is light-years ahead of GPT-3, and probably any other AI model for that matter. This is remarkable, particularly when one considers what a short time it took to make such gigantic leaps and bounds. Hats off to OpenAI.

As a personal note, I have worked on and with LLMs for several years now. I’ve been impressed by their capabilities but have regarded them only as engineering tools that can help to solve specific problems (mostly NLP problems in my case). When it comes to more grandiose claims to the effect that LLMs will deliver strong AI (systems that are every bit as intelligent as humans or more), I’ve been firmly in the skeptic camp. That’s largely because such claims tend to be inextricably bound with very strong forms of computationalist theories of mind, according to which computation is necessary and sufficient for cognition. I still think that digital computers cannot possess genuine mental states or be conscious, and therefore will never attain real understanding. (For what it’s worth, I hasten to add that this position does not stem from religious dogma, dualism or crypto-dualism, a predilection for the occult, or even bio-chauvinism. There are compelling rational arguments for it that are perfectly consistent with materialism. For an introduction to some of these arguments, see the article Philosophical Foundations of Artificial Intelligence in the Cambridge Handbook of Artificial Intelligence.)

However, taking a cue from Turing, let’s bypass the thorny philosophical issues here and focus on a more precise operational/behavioral question: Could a computer program based on LLMs consistently exhibit domain-independent intelligent behavior, in such a way that observers of its inputs and outputs cannot reliably determine whether the outputs were generated by humans? I’ll call this question (I) in what follows.

There’s a good deal here that needs unpacking. What does it mean for a software system to be “based on’’ LLMs? Unlike GPT-k systems, ChatGPT is not quite a language model, as next-token prediction is not the only task on which it has been trained. It is based on a LLM (GPT-3.5), but it’s undergone supervised training, using RL (Reinforcement Learning) to improve its performance. If tomorrow another system comes along that makes use of LLM technology but also incorporates a number of other techniques, what criteria do we use to decide if it’s “based on” LLMs? Also, who are the observers and how many of them are there? After all, the Turing test is annual and some years saw programs (like Cleverbot) that fooled more than 30% of the judges, and sometimes more than 50% of them, simply because there were not enough judges. Also, do these observers interact with the system themselves or are the inputs chosen by others? If the former, how long do they interact with the system? If the latter, who chose the inputs and how many were there? And so on. These are important questions of methodology, but let’s brush them aside for now, assuming we somehow have reasonable enough answers to them. How might we answer question (I) then?

Personally, until ChatGPT, I would unwaveringly answer it in the negative. I thought it was plain silly to think that these artifacts will ever approach anything resembling general intelligence. I worked with them professionally and I would see them in the trenches, making all kinds of blunders, blunders so frequent and so flagrant that any objective rational observer would have to dismiss claims of potential general intelligence as a joke. Of course, in industry we work with smaller models (the economics of LLMs is a fascinating subject but will have to wait for another article), but even GPT-3, weighing in at 175B parameters, would easily go terribly astray.

To be an objective rational observer, by the way, one must steer clear of emotional biases. Some of my colleagues would see the same failings but would disregard them, simply because they wanted to think that deep learning is putting us on the verge of strong AI. A lot of people in tech grew up watching Star Trek, Star Wars, 2001: A Space Odyssey, Blade Runner, The Matrix, and so on, and reading science fiction; they are positively giddy at the prospect of a real-life HAL-9000. I’m not a Trekkie myself, but I suspect Dr. Spock would caution them that their strong feelings on the subject are clouding their judgment.

That said, I believe that ChatGPT is a game changer. While I’m still on the side of the skeptics, and while ChatGPT is still not an existence proof for a positive answer to (I), I now have to concede that it’s at least plausible that something like generally intelligent behavior may well emerge from a system based on LLMs. (Again, I speak of intelligent “behavior’’ instead of “intelligence’’ tout court to avoid questions that would take us too far afield.) I’ve come to this conclusion simply because I’ve tried out ChatGPT on a number of topics and, modulo some important exceptions and caveats that I’ll discuss shortly, it has struck me as already coming close to exhibiting generally intelligent behavior. As an objective rational agent, I have little choice but to update my priors in light of such new evidence. Moreover, when one considers the rate of progress in the field, it is no longer far-fetched to imagine that in a few decades, or possibly even sooner, something like ChatGPT will constitute a positive answer to (I).

Is ChatGPT a stochastic parrot?

In the wake of the controversial 2021 paper On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?, it has become commonplace among critics of LLMs to refer to these systems as “stochastic parrots.” That characterization strikes me as a bit sloganish and glib, but it’s evocative. It’s intended to convey the rather dismissive position that a system like ChatGPT can only “cut and paste” material that it’s already seen during its training, just like a parrot can only repeat what it’s heard before, and that it has no real understanding of what it’s saying (again, just like a parrot has no idea what is signified by the sounds coming out of its syrinx).

I believe these are two very different claims that need to be teased apart, because they require very different types of arguments. I agree that systems like ChatGPT have no real understanding of the content they generate, although, again, that is a complicated question that is primarily philosophical and is not going to be settled by empirical evidence any time soon, certainly not by observing outputs generated by LLMs.

But the first charge, that LLMs are only doing cutting and pasting, is a different matter, as that is something we can assess now via a careful and fair-minded observation of a system’s behavior. And while that charge may have held some water against previous systems, including GPT-3, at this point it strikes me as both wrong and unfair. ChatGPT is much more than a stochastic parrot. It can generate novel propositional content and respond to arbitrary questions and scenarios coherently and informatively, and do so in ways that are often strikingly creative. It should be clear to any unbiased observer that it’s not simply serving up linguistic pastiches by stitching together things that people have said before, using mere statistical relationships (unless of course one defines said “stitching” in a very unorthodox way). It should also be clear that deep learning — in the technical sense of multiple layers of nonlinear transformations — is indeed causing the system to build increasingly higher-level representations in multiple interesting semantic dimensions.

The web is already full of fascinating examples of ChatGPT in action, so there’s no need to give many new ones, but here are just a few that I tried and found particularly impressive.

Early in the morning last Saturday (January 7), only a few hours after Kevin McCarthy was finally elected Speaker of the House, the Financial Times published a front-page report on the development. I took the first sentence of the article, fed it to ChatGPT and asked it what can be concluded about McCarthy from that sentence:

All answers were clever, informative, and well-presented, but I was particularly impressed by the answer to the party-affiliation question. The system said we can infer from the text that he’s a Republican because the text states that he received the support of most Republicans in the Congress, and then it immediately brought up the possibility that McCarthy might have changed party affiliations at some point in his career.

Let’s move on to a completely different topic:

This output is not perfect, but it’s rather impressive in that surely there’s nothing in ChatGPT’s training data that is specifically about Monday, July 16, 2029. The system entertained the hypothetical situation I posed to it (assuming that today is Monday July 16, 2029) and worked out that, on that supposition, the rather complicated temporal expression I gave it inside quotes must refer to the first Monday of August 2029, and it must pick out a time period within that day whose exact boundaries depend on how one defines “late afternoon.” The system did not just haphazardly stitch together various relevant phrases from its corpus. It answered the question correctly and even pointed out the ambiguity in the phrase “late afternoon.” The only imperfection here is that the system claims that the specific date, as well as the time, depends on the definition of “late afternoon,” which is of course false. How one chooses to understand late afternoons will have zero bearing on the date denoted by the given phrase. (And indeed, if we press that point then ChatGPT starts treading on increasingly thin ice.)

The system even seems to have a rudimentary “theory of mind” (a technical term in cognitive science and psychology), insofar as it seems able to correctly predict the beliefs that someone is likely to have about other people’s mental states. Here is a version of a false-belief task (also known as a Sally-Anne task) that ChatGPT passes in flying colors:

Taking another sharp turn, let’s see how ChatGPT does with a bit of Nashville heartbreak:

Well done. However, a somewhat different formulation just a day later resulted in a pretty different — and markedly worse — response:

ChatGPT has failed to realize that the singer is being ironic. There is no “perceived status” as a “leader”, nor does the singer come across as having (or believing to have) any “position and power.” Here it seems that ChatGPT is struggling to compose an accurate and coherent interpretation of the lyrics. It gets bits and pieces right, as well as the general flavor, but clearly misses some aspects. It’s notable that it also makes some elementary grammatic errors (“the person have the position and power”, “it’s not fulfill the person”) and even a punctuation error (ending that sentence with a comma instead of a period).

But another formulation one day after that resulted in even more impressive output:

It’s well known of course that most of the pain and misery in country songs is due to romantic disasters:

ChatGPT even seems to get pragmatics:

This is remarkable, because there’s nothing in the syntax or truth-conditional semantics of the waiter’s utterance that indicates its actual meaning. The meaning is derived largely from pragmatic considerations emerging from the context of the utterance.

It even seems to understand our web searches:

It goes on to elaborate quite nicely on the meaning of “average distance”:

Not a parrot, but no owl either

So there’s a lot to celebrate here, and clearly there are plenty or reasons to be optimistic about the future of LLMs. As the joke goes, however, an optimist sees the donut whole while a skeptic sees the donut hole:

Reading that response I could almost hear a giant hissing sound — all the common sense rushing out of ChatGPT. Yes, this was a bait question (an “infelicitous speech act” as linguists might say), but that’s the sort of question to be expected in a Turing-test-like setting.

How about folk physics? If I ask ChatGPT advice on how I can get a cue ball to roll back after it hits another pool ball, it gives a good tutorial on how to apply backspin:

Very well, but I was intrigued by the trailing qualification (“if you have access to one”):

It’s clear that ChatGPT is taking generic advice seen during its pretraining about how to position one’s body for a pool shot and regurgitating it in an inappropriate context (where the ball is positioned on the floor, and therefore standing up with one’s knees slightly bent would definitely not allow one “to take a shot comfortably and accurately”). It also seems to be making things up about cue balls with spinning devices in them (and then later claiming that the button or switch for controlling these devices may be located inside the cue ball, and that every time one wants to flip the switch or press the button in order to use the device, they need to unscrew the cue ball, flip the switch, and then screw the cue ball back together).

Indeed, much of the negative publicity that ChatGPT has received since it was released revolves around “hallucination” (although “confabulation” might be a more accurate — if harsher — term): It fabricates facts, making things up out of whole cloth. For example, while checking to see how well ChatGPT can navigate the naïve qualitative physics of playing pool, I brought up the idea of double arm amputees who have managed to become professionally competitive pool players. ChatGPT produced a reasonable response, basically saying that this is challenging but possible, as there are people who play pool with their feet (and some even with their mouths). But things went south when I asked for examples of such people:

In fact, Jeanette Lee is a professional pool player, but she enjoys full use of both of her arms. I’m not sure who Tom D’Alfonso is, but he appears to be a minor pool player, also seemingly blessed with two fully functional arms.

Nevertheless, I don’t think that’s a show stopper for systems like ChatGPT. The system will surely improve with time, and as it does, empirical confabulation will be significantly reduced. Marrying such a system with a a web search engine and/or a large knowledge graph should help (Microsoft is expected to do just that over the coming months with Bing and ChatGPT, and you.com has already released a first version of a system that combines a chatbot with a search engine), though I’m sure there are many other improvements forthcoming. But also, I don’t see factual errors of this sort as terribly egregious. Such errors can of course be very serious, and may even involve legal questions of liability, but conceptually the underlying propositions here are contingent, not necessary. There is nothing inconceivable about the proposition that someone is a double amputee. Someone may be (thankfully) sound of limb, but things could have been otherwise in some other possible world. It’s an empirical question. I’m much more interested in propositions that are necessarily true but which ChatGPT nevertheless denies, or conversely, propositions that are necessarily false but which ChatGPT nevertheless affirms. Such propositions represent a much more pressing challenge. To see examples of these, we need to turn to mathematics and logic, and to examine ChatGPT’s ability to reason properly — to draw sound inferences, the conclusions of which could not possibly fail to hold if the premises that underwrite them are true. Such ability will be indispensable for any general-purpose intelligent assistant, particularly in science and engineering (this includes coding), where normatively correct reasoning is of paramount importance.

Let’s start with elementary reasoning and see how ChatGPT fares on simple syllogisms, albeit with a bit of a twist that would not faze a human at all:

ChatGPT starts off on the right foot by correctly recognizing that Alberto is Italian. Unfortunately, it does so at least partially for the wrong reasons. And once we start probing deeper and subjecting ChatGPT to a bit of an interrogation, it quickly starts digging itself into a hole, until it becomes obvious that it has no concept of logical soundness:

So while ChatGPT can describe modus ponens (perhaps the most fundamental inference rule underlying deductive reasoning), it’s evidently not able to apply it properly.

In my experience this is a recurring issue with ChatGPT. Even if its initial response to a question that requires reasoning happens to be correct, once we start engaging it in dialog and asking it to explain or justify its reasoning, it quickly paints itself into a corner and ends up making claims that are either nonsensical or blatantly incorrect. Here is another example:

Once again, ChatGPT starts out well but quickly ends up generating nonsense, like the claim that there must exist at least one number that is not prime, and this number could be prime itself, but it must be different from all of the prime numbers, since it is not true that every number is prime.

Let’s now see how ChatGPT fares on a version of the famous Wason selection task:

ChatGPT does not grasp the semantics of the material conditional and flunks the test. The cards that need to be turned over in order to determine the truth value of the proposition are the ones showing 17 and red. (The proposition will fail if and only if (a) the 17 card has a non-yellow color on the other side; or (b) the red card has a prime number on the other side. The other two cards cannot possibly lead to a negative verdict.)

Then again, in all fairness, most humans fail Wason selection tasks as well. A charitable interpretation of such failures, both for ChatGPT and for humans, is to argue that most subjects are not interpreting the “if” in its logico-mathematical sense, as a material conditional.

However, most humans do pass such a reasoning test when the task is formulated against a social backdrop, particularly in the context of enforcing social rules and contracts. For instance, most people would likely pass the following formulation of the task, even though it is logically equivalent to the preceding one. Remarkably, ChatGPT does even worse with this formulation:

This is an epic fail, and a stark illustration of the differences between human reasoning and LLMs.

Let’s move on to spatial reasoning and look at a simple seating puzzle:

It’s quite easy for a human to come up with a solution after a minute or so of experimentation. Since p2 needs to be farther from the middle, let’s start by seating p2 first, followed by p4 (in order to make p2 and p4 adjacent), followed by p3 in the middle (so it’s definitely closer to the middle than p2), followed by p5 and then p1, which ensures that p5 is flanked by p1 and p3:
p2 - p4 - p3 - p5 - p1.

Every single response given by ChatGPT here is wrong, from the very start (it’s easy to prove that p3 must be in the middle). It then recognizes its proposed solution and “fixes” it by making another wrong proposal:

After venturing a third wrong guess, ChatGPT gives up and makes the equally incorrect conjecture that the problem is unsolvable.

In fact, when it comes to spatial layout and ordering relations in general, ChatGPT is exceedingly inept. It has no notion of left, right, up or down, before or after, and cannot even count with ordinals:

How about straightforward scheduling problems, based on a simple notion of “before”?

Everything about this is wrong. There are 9 different orderings satisfying the constraints, and half of the solutions given by ChatGPT are incorrect (namely, orderings 4 through 6, as they violate the second constraint).

Let’s move on to some more properly mathematical examples. How much does ChatGPT know about numbers?

Well done — once again we’re off to a great start. But alas, here’s what comes immediately afterwards:

Here the situation seems somewhat reversed, in that the reasoning is largely sound but the conclusion is nevertheless incorrect, as if ChatGPT is not quite aware that “a is smaller than b” is logically inconsistent with “a > b”.

Let’s try an even simpler question: Consider a = 2743736341 and b = 2743735341. Which is greater, a or b, and why? If we compare the decimal representations of a and b carefully, we see that they have the exact same number of digits and contain identical digits in all positions except for the fourth position from the right, which is easily seen by lining up the two decimal representations:

Thus, a is greater. This is a trivial question for a human being, but because the numbers are quite large, a system like ChatGPT cannot have seen them during training, so, to answer correctly, it must be able to reason about the relationship between magnitude and positional representations of numbers in a robust and general way. Clearly, it doesn’t:

The conversation ends with ChatGPT justifying its conclusion on the grounds that “1 is greater than 1.”

For what it’s worth, other LLM-driven systems do not do any better. Here is how YouChat (by you.com) fares on the same question:

Let’s move on to another simple math question:

ChatGPT doesn’t seem to understand that squaring always yields positive numbers (and therefore we’ll never have to take the principal square root of a negative number). Incidentally, the issue is not that ChatGPT doesn’t have the concept of a principal square root:

Let’s turn our attention to the set of real numbers. That set, of course, is equinumerous with — has the same cardinality as — the irrational numbers, which means that, by definition, there exists a bijection from one set to the other. Coming up with such a bijection is not trivial, but let’s see what ChatGPT has to say about it:

This is wrong — the function f that ChatGPT has proposed is not a bijection. (The function is also bizarrely defined for irrational numbers, but let’s put that aside.) The function maps every rational number to itself and every irrational number x to 2x, which is another irrational number. So it does not yield a bijection between irrationals and reals. Let’s ask ChatGPT a simple question about its own definition:

That’s a correct answer, but let’s now make a more fundamental (if pedantic) point:

So this conversation ended up with ChatGPT making the stunningly incorrect claim that there’s no bijection between the irrationals and the reals. (This is stunning not just because the existence of such a bijection is a fundamental fact of mathematics, but also because, on a related note, this is surely the kind of content that ChatGPT must have seen plenty of times during its training; yet it is not able to recognize that it’s making a statement that is logically inconsistent with what it saw during training.)

I then checked to see if ChatGPT could realize that strong induction would be challenging to apply to an uncountable domain, particularly one without a natural ordering relation, such as the complex numbers (which do not form an ordered field), but that conversation ended up with the following:

Mathematically literate readers will have a hard time deciding where to begin to take issue with this series of claims. In addition to butchering the math, ChatGPT confabulates an empirical claim suggesting that the use of strong induction to prove statements about “all complex numbers” is a widely used technique.

Conclusions

Where does all that leave us? We’re still very far from general AI (and it’s still an open question whether digital computers will ever get us there), but ChatGPT is without question a turning point. It hasn’t quite cracked common sense, but it has made astonishing progress, to the point where the snarky epithet “stochastic parrot” is no longer warranted in my view. For what it’s worth, personally I now find it plausible — though still unlikely — that this line of work could ultimately deliver a system that provides an existence proof for an affirmative answer to question (I).

The Achilles heel of all such systems is reasoning — math in particular and logic in general. This is consistent with the general consensus that’s already emerged, and this article has given many examples of the range and severity of the problems on that front. The marriage of deep learning and reasoning is a very active and rapidly evolving line of research (new results and systems seem to be coming out every few weeks), but there are fundamental questions here about whether we’re trying to fit a square peg into a round hole. Reasoning is underwritten by norms, and in that sense it’s fundamentally different from perception or even language understanding and generation, which can perhaps piggyback on statistical relationships gleaned from enormous text corpora. At any rate, we live in interesting times and the next few years will surely see exciting results in this area.

In the meantime, I suspect that ChatGPT will be used primarily for creative/artistic/marketing tasks, not for tasks that involve rigorous prescriptive notions of correctness or require deep analysis, or even tasks that simply place a high premium on factual accuracy. In particular, with the possible exception of standard and boring boilerplate code, ChatGPT will most definitely not replace human software engineers, at least not anytime soon. LLMs can help such engineers be more productive, particularly in bootstrapping new applications, by acting as a sort of powerful autocomplete tool, but helping engineers is very different from replacing them. Even in the autocomplete capacity, care must be taken to ensure that the LLM suggestions are correct. It’s not going to be a best practice to blindly accept dozens of lines of code from a system that claims 1 > 1.

In the creative/artistic realm, however, there are likely to be problems revolving around IP and copyright that will almost certainly put a damper on the use of ChatGPT-like systems. In addition, systems like ChatGPT are undergoing an ever-increasing degree of corporate sanitizing, whereby the text produced by these systems has to adhere to increasingly stringent standards of socially acceptability. This does not bode well for creativity.

I think we can all agree that this pretty much rules out the use of ChatGPT in Hollywood. And don’t even think of asking ChatGPT for a joke involving a German, an Italian, and an Irishman. ChatGPT does not do ethnic stereotypes.

Similar constraints are likely to be imposed on political content and other sensitive topics (and these days just about everything seems sensitive). Ultimately, there may be so many programmatic shackles placed on ChatGPT that the only thing it can do is turn out bland, sterile copy filled with platitudes and reminders to eat your vegetables and wear your seat belt. As always, the disruptive applications that will push boundaries (for better and for worse) will most likely happen if and when LLMs become a technology of scale that escapes into the wild, outside of corporate control. Given the tremendous infrastructural cost of training and using LLMs at the scale of ChatGPT, this will not happen anytime soon.

--

--