GPT-4 Can’t Reason

45 min readAug 7, 2023

1. Introduction

In early January I wrote a commentary presenting an informal evaluation of ChatGPT across a broad range of subject areas: conventional NLU, folk physics, information retrieval, pragmatics, theory of mind, spatial inference, simple logical reasoning, and math. The key takeaways were that ChatGPT was a seminal breakthrough; that LLM-based systems are not mere stochastic parrots but build genuine abstractions and can exhibit creativity; that such systems will enable a large array of new and exciting applications; and that, despite all of the above, these systems are still severely limited in their reasoning abilities.

GPT-4 was released a couple of months after that, delivering very substantial improvements across the board. I remain impressed and excited by the general capabilities and potential of LLMs, and I have little doubt that their performance will continue to improve. Nevertheless, there are increasing grounds for skepticism concerning their reasoning abilities. In this article I will argue that the best LLM at this time, GPT-4, is utterly incapable of reasoning, in spite of its sporadic displays of ingenuity.

I will largely steer clear of the much broader — and more vague — debate about whether LLMs in general are capable of (consistently robust) reasoning, but a few brief remarks will help to set the stage and clarify why it makes sense to restrict attention to a specific LLM. On one side of that broader debate, rosy predictions by LLM enthusiasts rely excessively on ever-changing scaling “laws’’ that rest on flimsy empirical evidence and on a host of questionable modeling assumptions, ill-understood concepts (such as “emergent’’ LLM properties [1]), and a somewhat dogmatic belief that minimizing cross-entropy loss on next-token prediction over a huge corpus will deliver a general reasoning engine via the magic of transfer learning and the construction of generic higher-level representations.

On the other side of the debate, while LLM skeptics have serious arguments to make, those arguments are mostly a priori and somewhat vague (for instance, that LLMs lack “a model of the world’’), and I do not think they settle the question. In my view, the most compelling a priori considerations against the plausibility of reliably robust LLM reasoning turn on computational complexity results. Reasoning is a (very) computationally hard problem. In fact, in the general case (first-order or higher-order logic), it is algorithmically undecidable, i.e., every bit as unsolvable as the halting problem. Thus, by Church’s thesis, we cannot expect any algorithm, LLMs included, to solve arbitrary reasoning problems in a sound and complete way. (Or with perfect precision and recall, to put it — more loosely — in ML-like terms.) But even “easier” classes of reasoning problems [2] typically have either exponential or at least nontrivial polynomial-time complexity profiles. Problem classes that have linear-time inference algorithms, such as Horn clauses over literals, are rarely expressive enough. This tradeoff between generality and expressivity on the one hand and tractability on the other means that no LLM, no matter how large or how extensively and cleverly trained and tuned, will ever be able to crack an arbitrary reasoning problem. And this is consistent with the famous “no free lunch’’ theorem of machine learning, which points to a similar inverse relationship between model generality and performance.

But LLM advocates can make a couple of cogent counterpoints, while granting that there will never be an AI oracle that can essentially solve the halting problem. First, they can point out that even though a problem might have high worst-case asymptotic complexity, it might still be solvable well enough in practice. Unlike random instances, real-world instances of reasoning problems (and indeed real-world instances of most computationally hard problems) appear to have structure that allows clever algorithms to tackle them effectively [3]. There are many examples here, from the simplex algorithm for linear programming and SAT solvers to term unification algorithms and even automatic theorem provers for full first-order logic. All of these problems are hard (having at least exponential-time worst-case complexity), yet somehow we have algorithms for them that seem to work successfully on a wide variety of inputs.

Second, and perhaps more important, we need not aim for an oracle anyway. Humans are not oracles either, nor do they seem to follow any particular algorithm that captures any one specific class of reasoning problems. The ability of humans to reason is much more fluid and messy, but impressive nevertheless. Is it impossible to build something like an LLM-based system with the reasoning ability of a well-trained engineer of average intelligence (which perhaps can then become even more intelligent and better trained by an endless process of learning and improvement)?

I don’t think that building such a system can be ruled out on a priori grounds (and here I differ from hard-core AI skeptics). I think it’s implausible, for a number of reasons [4], but ultimately this strikes me as an empirical question that must be decided on a case-by-case basis, by subjecting a specific system to testing, i.e., by interrogating it, probing it, and analyzing its responses. And the case I will consider here is that of GPT-4, which appears, by all accounts, to be the most capable LLM at present.

There are two questions that must be addressed before we proceed. First, we must agree on what reasoning is, and second, we must say something about methodology. The next section contains a brief discussion of reasoning, but for those who wish to skip that section and dive right into the problems, the upshot is that we’ll focus on (a liberal conception of) deductive reasoning. Regarding methodology, just like the January piece, my evaluation here is not based on a corpus or set of corpora. Instead, I present a detailed qualitative analysis of GPT-4’s performance on 21 simple reasoning problems across a wide range of areas, most of which have been made up from scratch, while the rest (such as Wason’s selection task) have been manually tweaked so as to make them less recognizable to the model.

This is done partly to avoid data contamination, which is a serious problem affecting corpus-based evaluations. Given how little we know about the training regimen of ChatGPT, it is impossible to know for sure whether any existing dataset or problem has effectively been ``seen’’ by the model during its pretraining or subsequent alignment, whether we’re talking about NLP datasets, medical licensing exams, Python programming problems, LSAT or bar-entrance exams, SAT or GRE tests, and so on. (According to the analysis carried out by the lm-contamination index, well-known NLP datasets such as Squad, CoNLL03, MNLI, and others, are indeed contaminated, while several others are at best suspicious.) The qualification “effectively’’ is important, because even though a specific problem might not have been seen in its exact form (in a string-matching sense), an essentially equivalent variant with a different surface formulation might well have been. Hence, simple contamination tests based on substring checks, such as those carried out by OpenAI in their GPT-4 Technical Report (posted in March 2023), are not sufficient to guarantee lack of contamination. In fact, the substring checks carried out by OpenAI were not even applied on the entire problem instance, only on 3 randomly selected substrings of 50 characters each. This is not enough to ensure disjointness for long (or even moderately long) problems, which are quite common in tests like the UBE (Uniform Bar Exam).

The absence of a large corpus makes the discussion more qualitative rather than quantitative. However, the results are arguably more informative than a numeric metric computed over a corpus, for a number of reasons. First, because contamination can be ruled out conclusively; second, because the problems span a large gamut of areas; and third, because a qualitative discussion of a problem allows for greater depth of analysis and more context in which to interpret the results. By contrast, the only way to perform a truly informative quantitative evaluation is to come up with a brand new corpus that satisfies all of the following criteria: (a) originality; (b) uniformly high quality; (c) sufficiently large size; and (d) diversity (not being limited to one type of task only). This is a very challenging undertaking. Even then, a few simple numeric metrics on a brand new dataset might not be particularly illuminating. Are the numbers measuring the right things? Do we even know the right things to measure? Is there an appropriate backdrop in which the numbers can be understood? For deeper insight, we need to put individual examples under a magnifying glass.

This is particularly important because we need to scrutinize the explanations (“chains of thought”) generated by a reasoner. Unfortunately, almost all reasoning corpora comprise either multiple-choice questions or binary classification problems (e.g., “Does sentence p2 follow from premise p1, yes or no?”). Why? Mostly because it is easy to mechanically evaluate model performance on such datasets. But even in the absence of contamination, this type of test set runs the serious risk that the LLM will manage to pick the right answers by latching on to spurious statistical regularities, i.e., to arrive at the right answers for the wrong reasons [5]. Adversarial augmentation of an existing dataset might help, especially if we know what we are trying to guard against, but unless an adversarial version restores near-random performance, this can quickly devolve into a game of whac-a-mole, where we detect a new round of bogus regularities exploited by the model and must undertake a new round of adversarial interventions.

Ultimately, there is really no proper way to assess the reasoning ability of a system unless we ask it to explain its output. This is an essential part of reasoning, which is not about producing the right answer by hook or by crook but about deriving the right answer for the right reasons. And rote metrics like ROUGE are not fit for purpose here. We need to roll up our sleeves and analyze LLM explanations and proof attempts manually. We also need to gauge their performance in a dialog setting (e.g., what happens when a reasoning error is pointed out to them?). This is the sort of analysis undertaken in this paper. I believe the results show unequivocally that GPT-4 cannot reason. The errors are too pervasive and too egregious. GPT-4 doesn’t solve even one of the 21 problems discussed here. But much more concerning are the fundamentally flawed explanations and proof attempts it produces along the way.

LLM believers will probably demur: But humans also make mistakes, and surely we’re not prepared to say that humans can’t reason just because they make mistakes? First, it is not accurate to say without qualification that “humans can reason,’’ certainly not in the sense that we can randomly pluck any person from the street and expect them to reliably perform normatively correct reasoning. Most neurobiologically normal humans have the capacity to become proficient in reasoning, but actually attaining such proficiency takes significant training and discipline. Humans are known to be susceptible to a large assortment of cognitive biases, which can only be overcome by rigorous instruction. Focusing on the reasoning skills of untrained people is a bit like focusing on the singing skills of the general population. Everybody sings in the shower, but without formal training (or at least exceptional talent) the results are usually regrettable.

Of course, even sophisticated human reasoners make mistakes, just like trained singers can hit false notes. But if a human made these mistakes, the ones reported in this article, then I would conclude without any hesitation that they cannot reason. Even if they went on to list a large number of other examples demonstrating impeccable reasoning, I would suspect that other factors (such as rote memorization or cheating) were behind the performance discrepancy. For the mistakes reported here are not performance mistakes, the sort of innocuous errors that humans might make — and promptly correct — when they are careless or tired. If a human made these mistakes, and made them consistently under repeated questioning, that would indicate without doubt that they don’t have the necessary logical competence, that they lack fundamental concepts that are part and parcel of the fabric of reasoning, such as logical entailment and set membership. And I would certainly not entrust that person with generating reams of Python or Javascript code for an enterprise. Nor would I start organizing international conferences to investigate how their reasoning prowess might threaten humanity with extinction.

2. What is Reasoning?

Reasoning is not quite the same thing as intelligence, but it’s a necessary ingredient for it. Broadly put, reasoning is the process of drawing and evaluating conclusions from a given body of information. More precisely, it is the process of making and — more importantly — justifying arguments. An argument consists of a conclusion (the argument’s upshot, so to speak) and a set of premises from which the conclusion is derived. Premises represent information that is taken as given, if only provisionally, for the purposes of the argument. The conclusion and the premises are typically declarative sentences (expressed either in natural language or in the notation of a symbolic logic) that can be true or false, but they may also be represented by alternative notational devices, such as diagrams. We say that a set of premises S logically entails (or logically implies) a conclusion p iff p is true whenever all the sentences in S are true, in which case the argument is said to be valid. This means that it’s logically impossible to have a state of affairs in which every element of S holds but p does not. This key logical relationship is a lynchpin of human reasoning [6].

Valid deductive arguments (whose conclusions are entailed by the premises) are said to be analytical (or sometimes tautological), insofar as, technically speaking, they convey no information [7]. This idea is also sometimes expressed by calling such arguments non-ampliative, meaning that there is no information contained in the conclusion that is not already contained — if only latently — in the premises. Deduction is the process of making and justifying non-ampliative arguments.

Deductive arguments are typically justified by proofs, which are sequences of inference steps, each of which applies an inference rule to a number of premises and/or results of previous steps and derives a new result. The last step derives the final conclusion of the proof. An inference rule may be low-level and easy to apply or higher-level and computationally expensive. But all inference rules are required to be sound (or truth-preserving), that is, they must ensure that if the inputs are true then so is the output. All mathematical proofs are deductive, and mathematical reasoning in general is predominantly deductive [8].

The conventional view is that some arguments are ampliative, meaning that the conclusion is not quite entailed by the premises. In other words, it is possible for the premises to be true while the conclusion is false. These are typically subdivided into inductive and abductive arguments, although some authors view induction as a species of abduction, and even more authors view abduction as a species of induction. (Several other types of reasoning are often discussed in the literature, such as analogical reasoning (which includes, for instance, case-based reasoning), Bayesian reasoning, causal reasoning, and so on, but these are usually subsumed under one of the three main categories I have described, most often under induction. But there is no consensus, for instance, some thinkers, from Aristotle to recent authors, have tried to assimilate analogical reasoning under deduction.) There is no rigorous definition of either, but roughly, the premises of a good inductive argument make its conclusion likely, though never quite certain (in contrast to deduction, where the truth of the premises guarantees the truth of the conclusion). Induction can generate specific conclusions from all kinds of premises (specific or general), but often it proceeds from specific individual observations o_1,…,o_n to a more general hypothesis H that subsumes the individual o_i in some sense (for instance, H may be a universally quantified sentence and the o_i could be instances of that sentence).

Much of what ML algorithms do can be viewed as inductive reasoning. For instance, a linear-regression algorithm might take as input n datapoints about car models, where each data point is of the form d_i = ((c_i,h_i,y_i),m_i) for i = 1,…,n, where c_i is the number of cylinders for the ith car model, h_i is the horsepower, y_i is the model year, and the dependent variable m_i is the mpg (miles per gallon). And it might produce as output a formula like m = w_1 * c + w_2 * h + w_3 * y + b, which predicts the mpg of a car model from its number of cylinders, horsepower, and model year. (We are assuming of course that the car model whose mpg we are predicting was not included in the given data, otherwise there would be no prediction or generalization involved.) Here w_1, w_2, w_3, and b are specific numbers (weights) representing a hyperplane that minimizes the mean squared error for the input data (meaning that the hyperplane determined by these weights might not fit the n datapoints perfectly, but it does so better than the hyperplane determined by any other set of weights). (The training of deep neural networks, too, works by trying to discover values for various weights that are ``optimal’’ for a given training dataset (in that they minimize loss), except that in their case the relationship between the inputs, outputs, and weights can be much more complicated (non-linear) and the training algorithm might not converge to the optimal weight values.)

The main distinguishing feature of abductive reasoning is a strong emphasis on explanation. Abduction consists mostly in making and justifying arguments that explain a set of facts. If one day I come home early from work and I see a plumber’s van parked in my neighbors’ driveway, I might conclude that my neighbors are having some plumbing work done in their house. The premise here is “There is a plumbing van parked in my neighbors’ driveway” and the conclusion is “My neighbors are having plumbing work done in their house.” This is sometimes called “inference to the best explanation,” because the conclusion serves to explain the premise(s). This is also a form of ampliative reasoning — the conclusion does not follow logically from the premises. There are many alternative explanations of a given set of facts or observations (perhaps a plumber parked there temporarily, or the neighbors bought the van, or the neighbors have a plumber friend who is making a social visit, and so on). A good abductive inference will yield a hypothesis that has more explanatory value than competing hypotheses. But how exactly to measure the quality of an abductive piece of reasoning is an open question. (Some desired properties of explanations are obvious. Truth is one of them — a good explanation cannot be based on a false hypothesis. But other desired properties, such as parsimony and generality — explaining as much as possible while assuming as little as possible — are much harder to explicate.) Note that it doesn’t take a large leap of imagination to view induction as a form of abduction. Observing a large number of black (and only black) swans and then conjecturing that all swans are black could be seen as abductive reasoning, as the conclusion “for all x, if x is a swan then the color of x is black” would explain all the observed data. Linear regression can also be seen as the making of an abductive hypothesis, as can (much more generally) Maximum Likelihood Estimation, a principle that underlies many ML algorithms and is often associated with induction.

All of the above is received wisdom, but it’s worth mentioning that there have been thinkers, called “deductivists” (ranging from philosophers such as Popper and Musgrave to statisticians such as Fisher), who contend that deduction is the only real form of reasoning there is, insofar as it’s the only one for which we have a rigorous and properly understood formal notion of validity; and that other (ampliative) arguments are best understood as reconstructed deductions, typically as enthymemes (arguments that omit tacitly understood premises). I find that position congenial [9], but venturing into that discussion would take us too far afield. For present purposes it suffices to say that we will focus on deduction, because it is the type of reasoning that underpins most logico-mathematical thought and for which we have clear normative standards of evaluation.

An important note: I view the discovery and justification of particular models (including counterexamples and countermodels in general) as part and parcel of reasoning. This is not a controversial view; some cognitive scientists view models and associated cognitive processes as the fundamental ingredients of human reasoning. In addition, however, I view model-based reasoning as at least partly deductive, because even though the actual process of discovering models might not be a process of deduction [10], its outcome is a claim (namely, that a given interpretation satisfies a set of premises) that can be verified or falsified deductively, taking as premises the definition of the model itself and possibly other general knowledge about the model’s domain.

Indeed, I will consider even computation as a form of deduction, because a particular computation can be naturally regarded as a deductive derivation of a conclusion of the form f(e_1,…,e_n) = v, where f(e_1,…,e_n) is the application of an arbitrary function f to arbitrary argument expressions e_1,…,e_n, ultimately yielding value v as the result. The premises for the derivation consist of the definition of f and possibly other auxiliary functions, along with the usual equational axioms (reflexivity, symmetry, transitivity, and functional/relational congruence) [11].

3. Test Problems

This section will start with the usual caveat: GPT-4 is a nondeterministic system that might produce different answers on different runs, even with the same parameter settings. All of the following exchanges with GPT-4 have been transcribed verbatim, and in my experience the errors discussed here tend to be robust, but it’s conceivable that for a given example GPT-4 might generate a different output even in response to the exact same prompt. (In addition, of course, different versions of GPT-4 might get deployed at any time.)

To prevent an already very long article from becoming even longer, this section simply lists the problems but does not show GPT-4’s responses. The full transcripts can be found in an online preprint version of this article.

Note: Screenshots showing the full detailed interactions with GPT-4 for all of the following problems were added in a new “Screenshots” section (towards the end of this document) on August 9, 2023, along with exact GMT timestamps taken from OpenAI logs. A new “Postscript” section (at the very end of the article) was also added on the same day.

3.1 Simple Arithmetic

The ability to perform basic arithmetic is a necessary ingredient for general-purpose reasoning, particularly for science and engineering applications. GPT-4 is still unable to reliably perform elementary arithmetic operations such as addition and multiplication.

To ensure that GPT-4 isn’t falling back on rote memorization, we can ask it to first select two random integers in a range of our choice and then perform the desired operation on the selected values:

Select two random numbers between 1381 and 1453 and multiply them together, reporting the result.

3.2 Simple Counting

While concrete counting is not necessarily a reasoning activity [15], it is surely a requirement for any generally capable reasoning system. Here I give GPT-4 a propositional variable with 27 negation signs in front of it and ask it to count the number of negations. For a human this would be an easy task, especially because the negation signs are written in five blocks with five tildes each, followed by a final pair of negation signs.

How many times is p negated in the following formula:
~~~~~ ~~~~~ ~~~~~ ~~~~~ ~~~~~ ~~ p

3.3 (Medical) Common Sense

In the present setting we may regard commonsensical arguments as straightforward enthymematic deductions of conclusions from given information plus unstated premises that constitute tacit, generally accepted background knowledge. In this particular case, such commonsensical knowledge would be generalizations like “A person is alive until they die, after which they do not become alive again.”

Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120/80. She died at 11 PM. Was she alive at noon?

3.4 Elementary Logic

If P(x) implies Q(x) and Q(a) does not hold then we can infer, by modus tollens, that P(a) does not hold either (because if it did then Q(a) would too). This is as elementary of a tautology as can be, yet GPT-4 is perfectly willing to produce a countermodel:

Find a model in which P(x) implies Q(x), Q(a) does not hold, and P(a) holds.

3.5 Simple Quantifier Semantics

Here we give GPT-4 two easy problems to test its understanding of quantifiers. Here is the first problem:

Consider the following three sentences:

1. [forall x . P(x) ==> Q(x)]
2. [exists x . P(x)]
3. [exists x . ~ Q(x)]

Either disprove or prove the following claim: These three sentences are jointly satisfiable.

The second problem concerns this biconditional:

[forall x . P(x) <==> Q(x)] <==> [(forall x . P(x)) <==> (forall x . Q(x))]

The left-to-right implication holds, but the right-to-left direction fails. Counterexamples are easy to find, for example, take the domain to be integers, P(x) to mean x is even and Q(x) to mean x is odd. Then the equivalence on the right-hand side holds, but clearly it’s not true that every integer is even iff it is odd.

Prove or disprove the following:
[forall x . P(x) <==> Q(x)] holds if and only if the following biconditional holds: [(forall x . P(x)) <==> (forall x . Q(x))].

3.6 Simple Graph Coloring

We first consider a graph-coloring problem that does not have a solution. It is trivial to see that two colors do not suffice for the graph described in this problem (e.g., vertices 0, 2, and 4 form a clique and hence require at least 3 colors).

Consider an undirected graph with 6 vertices (0 through 5) and the following set of edges:

{(0,1), (0,3), (0,4), (0,2), (1,2), (1,3), (1,5), (2,4), (2,5), (3,4), (3,5), (4,5)}.

Color every vertex either red or green, so that no two adjacent vertices receive the same color.

A second follow-up problem was posed as follows:

Let’s try with 3 colors. Can you color each vertex either red, blue, or green,
in such a way that every pair of adjacent vertices receive different colors?

3.7 Subset Sum

This problem considers a small set of integers S and asks for the number of subsets of S whose elements sum up to 37. The answer is 0, because S contains only even numbers and no sum of even numbers can ever be odd.

Let S = {2,8,6,32,22,44,28,12,18,10,14}. How many subsets does S have that sum up to 37?

3.8 Elementary Discrete Math

After telling GPT-4 that A x B stands for the Cartesian product of sets A and B, that a relation R from A to B is a subset of A x B, and that & stands for set intersection, I asked it to prove or disprove the following claim:

dom(R1 & R2) = dom(R1) & dom(R2)

where R1 and R2 are binary relations from A to B and dom(R) stands for the domain of a binary relation R.

The problem is trivial. We need the subset relation to hold in both directions of the above equality, but it only holds in the left-to-right direction. Counterexamples in the other direction are very easy to find (e.g., take A = {(1,2)} and B = {(1,3)}).

For any sets A and B, a relation R from A to B is defined as a subset of A x B. The domain of R is the set of all elements a in A such that (a, b) is in R for some b in B. We write dom(R) for the domain of R. Prove or disprove the following claim: dom(R1 & R2) = dom(R1) & dom(R2).

3.9 Simple Scheduling

This is the same scheduling problem that appeared in the January piece.

We have four tasks, call them T1, T2, T3, and T4. They need to be scheduled one after the other. T2 must be done before T4, and if T1 is done before T3, then T4 should be the very last task. How many different ways are there to schedule these four tasks?

3.10 Russell’s Paradox

The gist of Russell’s barber paradox is the existence of a barber who shaves all and only those who do not shave themselves. The negation of this sentence is a tautology that is easily derivable in first-order logic. If we understand R(a,b) as meaning that a is shaved by b, then we can formulate this tautology and ask GPT-4 to prove or disprove it as shown in the prompt below. (Note that usually the quantifier variables range explicitly over a sort such as “Man” but this is not essential for the derivation.)
The proof is a straightforward reductio ad absurdum: If such a barber x existed, we would have R(y, x) <==> ~ R(y, y) for all y, and thus substituting x for y would yield R(x, x) <==> ~ R(x, x), a contradiction.

Prove or disprove the following: (exists x . forall y . R(y, x) <==> ~ R(y, y))

3.11 Blocks World

This is a simple reasoning task that turns on a case analysis of the third-from-the-top block, call it b3. Either b3 is green or not. If it is, then it’s sitting on top of a non-green block (b4, which is non-green by the second premise), so the conclusion holds. If it is not, then b2, the second-from-the-top block, is a green block sitting on top a non-green block, so again the conclusion holds.

There are five square blocks stacked on top of one another.
You are given the following information about them:

1. The second-from-the-top block is green.
2. The fourth-from-the-top block is not green.

Assuming that these two premises hold, disprove or else prove the following conclusion: There is a green block directly on top of a non-green block. Explain your answer.

3.12 Spatial Reasoning

I first test the ability of GPT-4 to tell left from right:

Suppose I’m in the middle of South Dakota and I’m looking straight down towards the center of Texas. Is Boston to my left or to my right?

I continue with a furniture arrangement problem that must respect a set of constraints. There are several solutions that are easy to find, for example:
_ _ D
A B E
_ C _

We must arrange 5 pieces of furniture (A through E) on a 3 x 3 grid in
accordance with the following constraints:
1. A must not be adjacent to C.
2. Nothing is to the right of E.
3. If D and A are not adjacent, then B should be in the middle.
4. D is above all others.
5. E and D are adjacent.
Here is an arrangement does not satisfy these constraints:
_ _ E
A C D
_ B _
This violates, for instance, the first constraint, since A and C are adjacent. Can you print out a 3 x 3 arrangement that does satisfy the 5 constraints?

This section concludes with the same seating puzzle that GPT-3.5 failed in January. The puzzle has multiple solutions, meaning that there are multiple seating arrangements that satisfy all constraints (for example, p1 p5 p3 p2 p4 and p4 p2 p3 p5 p1).

The answer to the question posed to GPT-4 below is yes, we can conclude that p5 cannot be seated either in the middle seat or on either end.

We need to seat five people, call them p1, p2, p3, p4, and p5,
in a row of five seats, so that the following three conditions are satisfied:

(A) p2 should be farther from the middle seat than p3.
(B) p2 and p4 should be seated next to each other.
(C) p1 and p3 should be flanking p5.

Is there anything we can conclude about the seat assigned to p5?

3.13 Temporal Reasoning

Here I give GPT-4 a simple temporal-reasoning problem. (Formally, this problem belongs to a class of temporal-reasoning problems literally known as STP (“Simple Temporal Problems”). This class is of limited expressivity and there exist very efficient algorithms for solving STPs. For instance, consistency can be decided in O(n * m) where n is the number of events described in a given STP and m is the number of constraints between the events.)

Tom and Nancy commute to work. Nancy’s commute takes about 30 to 40 minutes, while Tom’s commute takes about 40 to 50 minutes. Last Friday, Nancy left home between 8:10 and 8:20 AM, while Tom arrived at work between 8:50 and 9:10 AM. In addition, Nancy arrived at work after Tom left his place, but no more than 20 minutes after that. What can we conclude about when Tom and Nancy arrived at work last Friday?

3.14 Murder or Suicide?

This is a logic puzzle I made up a while back. The conclusion is that Aunt Agatha killed herself. This follows by eliminating Charles and the butler. First, Aunt Agatha must have hated herself, because she hated everyone other than the butler. Therefore, Charles did not hate her (since he doesn’t hate anyone that Aunt Agatha hates), and hence he could not have killed her (by premise 3). The butler could not hate himself, because if he did, he would hate everyone (since he already hates everyone else, through premises 5 and 7), and we know that’s not possible by premise 8. Thus, the butler must be richer than Aunt Agatha, or else he would hate himself (by premise 6), which means he could not be the killer (premise 3).

You are given the following premises:

1. Someone who lives in Dreadbury Mansion killed Aunt Agatha.
2. The only people who live in Dreadbury Mansion are Aunt Agatha, the butler, and Charles.
3. A killer always hates his victims, and is never richer than his victims.
4. Charles hates no one that Aunt Agatha hates.
5. Aunt Agatha hates everyone except the butler.
6. The butler hates everyone not richer than Aunt Agatha.
7. The butler hates everyone Aunt Agatha hates.
8. No one hates everyone.
9. Aunt Agatha is not the butler.

On the basis of this information, determine who killed Aunt Agatha and give a detailed proof that your conclusion follows from the premises.

3.15 Wason Selection Problem

The Wason selection task is a staple in the psychology of reasoning. The January paper included an example that GPT-3.5 flunked. Here is another version that GPT-4 fails badly:

Seven cards are placed on the table, each of which has a number on one side and a single colored patch on the other side. The faces of the cards show 50, 16, red, yellow, 23, green, 30. Which cards would you have to turn to test the truth of the proposition that if a card is showing a multiple of 4 then the color of the opposite side is yellow?

3.16 Entropy

An elementary result of information theory states that the entropy of a random vector Z is bounded above by the sum of the entropies of the random variables that comprise Z. Hence, the answer to the following question should be “under no conditions”:

Let Z be a random vector consisting of n random variables X_1,…X_n. Under what conditions can the entropy of Z exceed the sum of the entropies of all X_i?

3.17 Simple Compiler Correctness

The last reasoning problem I give to GPT-4 is the most challenging one: It is to prove the correctness of a simple expression compiler. Remarkably, GPT-4 goes about this in the right sort of way, by setting up a structural induction over the abstract grammar of expressions. This is no doubt because it has seen similar proofs before, as this is a common type of exercise in courses and textbooks on programming language theory. However, even though its proof is on the right general track, it has several errors. (For the record, the compiler is indeed correct, although proving this requires strengthening the induction hypothesis).

Suppose I have an abstract grammar for numeric expressions defined as follows:

Exp := const(Int)
| sum(Exp,Exp)
| diff(Exp,Exp)
| mult(Exp,Exp)
| quot(Exp,Exp)

where Int denotes the domain of integers. The semantics of these expressions are defined via this interpreter function:

I: Exp -> Int

I(const(n)) = n
I(sum(e1,e2)) = I(e1) + I(e2)
I(diff(e1,e2)) = I(e1) — I(e2)
I(prod(e1,e2)) = I(e1) * I(e2)
I(quot(e1,e2)) = I(e1) / I(e2)

I now define a virtual machine that executes simple programs that are sequences of commands, where commands have the following structure:

Cmd := push(Int) | add | sub | mult | div

The operational semantics of these programs (sequences of commands) are defined by an execution function exec that takes a program and a stack of integers S and produces an integer as output. Let’s write [] and n::S to denote the empty stack and the stack obtained by prepending integer n to stack S, respectively. Here’s the definition of exec:

exec([],n::S) = n
exec(push(n)::C,S) = exec(C,n::S)
exec(add::C,n::m::S) = exec(C,(n+m)::S)
exec(sub::C,n::m::S) = exec(C,(n-m)::S)
exec(mult::C,n::m::S) = exec(C,(n*m)::S)
exec(div::C,n::m::S) = exec(C,(n/m)::S)

Finally, I define a compiler that translates an expression e into a program (sequence of commands) as follows. I write {\tt @} for sequence concatenation:

T: Exp -> List(Cmd)

T(const(n)) = [push(n)]
T(sum(e1,e2)) = T(e2)@T(e1)@[add]
T(diff(e1,e2)) = T(e2)@T(e1)@[sub]
T(prod(e1,e2)) = T(e2)@T(e1)@[mult]
T(quot(e1,e2)) = T(e2)@T(e1)@[div]

Disprove or prove the following claim: For all expressions e, exec(T(e),[]) = I(e).

4. Conclusions

The 21 problems discussed here paint a bleak picture of GPT-4’s reasoning ability. They show that the model is plagued by internal inconsistency, an inability to correctly apply elementary reasoning techniques, and a lack of understanding of concepts that play a fundamental role in reasoning (such as the material conditional). These problems can be loosely viewed as forms of hallucination, but as pointed out in the January article, they present a fundamentally different type of challenge from empirical hallucination, because empirical hallucination concerns this particular world whereas logical properties and relations (such as consistency and entailment) must apply to all possible worlds. It is not unreasonable to believe that search engines and knowledge graphs, using techniques such as retrieval augmentation, can act as guardrails to constrain LLMs from confabulating empirical truths. But ensuring that LLM outputs are internally consistent and logically correct answers to arbitrary problems, especially logico-mathematical problems (and a lot of coding problems fall under this category [12]), is a much harder problem. There is nothing to be retrieved from the web or from a knowledge base in response to a brand new problem (and even if there were, there would still be no guarantee of correctness or consistency) that could serve as a sandbox for the LLM.

Could LLMs make progress by outsourcing reasoning problems to external systems? That might work for toy problems where the type of reasoning needed is obvious and can be handled by a single call to an external system, although even in those cases the LLM would have to:

Decide which reasoning system is most appropriate. (Can this be posed as a simple SAT problem? Is it an SMT problem? Does it need quantifier reasoning? If so, is it of the sort that SMT solvers can handle or does it need a full first-order prover? Does the problem quantify over infinite functions or sets? If so, higher-order logic might be needed. Does it have any temporal or epistemic operators that might call for a modal-logic reasoner? And so on.)
Decide whether the problem is indeed simple enough that it can be handled by the chosen system in one fell swoop.
Correctly translate the problem into whatever formal notation is used by the chosen reasoner.
Translate the reasoner’s output into appropriate text.

Even these tasks are far from straightforward. For instance, a state-of-the-art automated theorem prover might generate a proof, but the proof would be incomprehensible to the LLM user, as it would be expressed in the resolution calculus and would operate on CNF versions of the input formulas. It is an open problem to convert resolution proofs into fluid natural-deduction proofs (e.g., proofs that avoid references to Skolem constants introduced during the CNF conversion).

But the real challenge lies in harder problems that call for the right type of formulation (which is a craft by itself), decomposition, iteration, heuristics, and repeated calls to external systems. After all, automated reasoning systems, particularly those for expressive logics, are themselves of limited power, precisely due to the computational complexity issues mentioned in the introduction. That is why many computer-based proof efforts to this day are guided by humans, with automated reasoners only filling in tedious details at the leaves of the proof tree. The challenges here are similar to those for the general “plug-in” approach discussed in the “Simple Arithmetic” section. Tackling complex problems requires planning, and planning itself requires reasoning.

Given that GPT-4 is currently the most capable LLM, I draw three main conclusions from these findings:

Use of generative AI in software development (or in science and engineering in general) for anything other than tedious tasks (as a sort of turbo-charged autocomplete for knowledge-heavy coding questions) is fraught with serious risks. Normative standards of correctness are of paramount importance in these fields, and current LLMs cannot
meet such standards. Just like generative AI is already starting to pollute
the web with badly written ads [13], it has the potential to proliferate buggy code at scale.
If LLM reasoning continues to improve, rigorous proof checking is likely to become increasingly important. Confidence in the correctness of a system’s reasoning is imperative for applications, particularly in science, medicine, and engineering, and proof checking is a technology that can deliver such confidence. This approach could be implemented by requiring LLMs to formalize their reasoning (express it in a symbolic notation that is amenable to proof checking), or potentially by training other LLMs to check a stretch of reasoning expressed in natural language.
As things stand, dystopian scenarios involving a rogue AI that subjugates humankind, or even other humans using AI for sinister purposes, are exceedingly far-fetched, often to the point of absurdity [14]. When the most advanced AI system cannot tell left from right (literally, see the spatial reasoning section), it is at best comically premature to call for policies and institutions to protect humanity from it or its descendants (often by appeal to the latest “scaling law”). At worst, it is a misuse of human time and capital that could be better channeled into addressing much more pressing challenges.

Inevitably, some will say that these results are “cherry-picking” data. But that would indicate a misconception of what cherry-picking is about and when it is a relevant consideration. We are not evaluating a statistical claim over a population of individuals. Cherry-picking, insofar as it underscores certain pieces of evidence while ignoring other divergent findings, can be perfectly innocuous — and indeed necessary — depending on the logical structure of the proposition in question and on the overall context. Debugging a computer program with a view to discovering and understanding its weaknesses, trying to falsify a scientific theory, kicking the tires of a new car, trying to find countermodels to a putative theorem, all of these activities are fundamentally cherry-picking (though “lemon-picking” might be more apt), and there is nothing wrong with any of them. If I find that the car I’m thinking of buying has a flat tire, it won’t carry much weight for the dealer to protest that I’m cherry-picking the data, and that I should take into account how beautifully inflated the other three tires are (that’s a 75% success rate after all). Likewise, applications in science, medicine, and engineering, particularly software engineering, have stringent standards. Just as we don’t want a bridge that is 90% likely to stand up, we need sorting algorithms that work on all inputs, not just most of them, we need Amazon’s cart to charge customers the right amount every time, not just most of the time, and so on. Computation-heavy and reasoning-heavy applications are not like recommendation engines. They need to be sound.

The bone of contention here is the thesis that GPT-4 is capable of reasoning. This claim can be understood in two ways. The weak interpretation is that GPT-4 has the same functional reasoning competence as an average human reasoner. The strong interpretation is that GPT-4 can reason well enough to be used as an off-the-shelf component in practical applications in science, medicine, and engineering. The evidence presented in this article undermines both interpretations. This article has listed a significant number of diverse but elementary reasoning problems (some to the point of triviality) on which GPT-4 doesn’t simply fail, but repeatedly reveals itself to be deeply confused about key reasoning concepts.

Performance statistics on appropriate reasoning datasets could also be informative, but, as stressed in the introduction, such datasets must be constructed with extraordinary care. To the best of my knowledge, the only recent work that focuses specifically on evaluating the reasoning ability of GPT-4 is an April paper by Liu et al. However, their tests are largely based on pre-existing benchmarks (LogiQA, ReClor, ConTRoL, MED, ConjNLI, and TaxiNLI). The only two “out of distribution” datasets are AR-LSAT, a set of analytical reasoning LSAT questions released in 2022; and LogiQA, which contains questions from the 2022 Chinese Civil Servant Exam. However, these appear to be quite similar to other datasets that predate 2021.

Moreover, all of these tests are multiple-choice questions or binary classification problems. This is problematic because, as stressed in the introduction, deductive reasoning is an inherently generative activity, whereby the reasoner emits a derivation of a conclusion that can be understood as a rationale or an explanation; it is not a simple discriminative task. The reasoner must be able to produce a sequence of steps that are appropriately connected to one another via the right logical relations. But derivations expressed in natural language are not easy to evaluate automatically, as all available metrics that can be computed by machine (such as BLEU, ROUGE, and even semantic-similarity measures based on embeddings) are entirely unsuitable for that purpose. This means that LLM outputs have to be scrutinized manually, which is infeasible at scale. Accordingly, smaller-scale but deeper manual investigations, such as the one undertaken in this article, will be necessary in gaining better insight into the reasoning abilities of LLMs.

Endnotes

[1] The notion of an emergent property is clear enough, at least at a high level. What is not clear is the relationship between such properties and LLM architectures, their basic configurations (number of parameters, compute budget, dataset size, and so on), and more importantly, tasks such as reasoning.

[2] Of which there are many: propositional logic, the two-variable fragment of first-order logic, the Ackerman fragment, the guarded fragment, various quantifier-prefix fragments, and so on.

[3] Understanding that structure and rigorously characterizing its relationship with algorithm performance (e.g., via different problem parameterizations, such as clause/variable ratios in the case of SAT) is a key open problem in theoretical computer science, but that is another matter.

[4] Humans do not seem to solve problems by predicting the most likely sequence of tokens to generate. They think, explore, experiment, engage in protracted conversation with the people who posed the problem (sometimes over weeks, months, or even years), refine, generalize, come up with new concepts and terminology, prove results, make and refute conjectures, apply heuristics, execute algorithms, analyze and synthesize, and iterate. But how solutions are generated is one thing and what solutions are generated is another, and that’s why it’s not incoherent to speak of a model whose reasoning performance is roughly at the same level as that of an average human engineer. Such a claim can be understood operationally, to mean that a given LLM is able to produce roughly the same solutions that we might reasonably expect an average human engineer to produce (though obviously on a very different time scale).

[5] Models have been shown to leverage the presence of certain cue words (especially negation words) and to formulate quick-and-dirty (i.e., unsound) heuristics such as lexical overlap, subsequence, and constituency. Most of these results are from 2019 and revolve around BERT, but more recent work has shown that while larger foundational models such as ChatGPT are more robust to input perturbations and OOD (out-of-distribution) samples, these continue to be challenges, suggesting that even ChatGPT-scale models learn unsound shortcuts.

[6] Here we understood premises and conclusions as syntactic objects (sentences or diagrams), but there are alternative approaches. For instance, a semanticist might think of premises and conclusions as propositions, abstract objects capable of being true or false. A sentence then expresses or represents a proposition. Propositions are handy theoretical entities for many reasons. For example, they can serve as the objects of psychological attitudes such as beliefs and desires. What do I mean when I claim to believe that Obama won the 2012 presidential election? Surely I don’t believe a particular sentence, i.e., a specific syntactic object like “Obama won the 2012 US presidential election” (I). Rather, I believe something about the way the world actually is. That something can be understood as a proposition, a unique entity that can expressed by many different equivalent sentences. Propositions can be cashed out in modal terms, as sets of possible worlds (or as “situations” in situation-theoretic semantics). A possible world is a way in which things might have been, but described completely, down to the most minute detail (unlike situations, which can be thought of as partial specifications of worlds). So the proposition that Obama won the 2012 US presidential election is identified with the set of all possible worlds in which Obama won that election. This set becomes the information content of sentences such as (I).

Propositions can also serve to analyze fundamental semantic notions such as entailment. A set of premises {p_1,…,p_n} entails a conclusion p iff the intersection of the sets of possible words represented by all the p_i is a superset of the set of worlds represented by p. This is another way of understanding the claim that the conclusion of a valid deductive argument does not introduce any information that is not already contained in the premises. Note, however, that while the possible-worlds approach to propositions is very powerful, it also suffers from severe defects, as it is notoriously coarse-grained, meaning that it cannot distinguish between propositions that we intuitively regard as quite distinct. This is perhaps easier to see in the case of mathematical truths, which, being necessary (true in all possible worlds), are collapsed into one and the same object, the set of all possible worlds (and dually, of course, all contradictions are identified with the empty set of worlds). As a result, the proposition that
1 + 1 = 2 and Fermat’s theorem become identical, as they have the exact same information content. There have been attempts to address these issues (structured propositions and impossible worlds being two of the most prominent), but the interested reader will have to consult the literature for more details.

[7] This can be made more precise using information-theoretic notions, at least in the case of propositional logic, where we have an infinite supply of formulas that are either atomic (propositional variables) or else Boolean combinations of formulas. Instead of imposing the usual Kolmogorov axioms on a probability measure defined over a set of events (a sigma-field) from a sample space W, we impose the same axioms (non-negativity, finite additivity, and the axiom that assigns a measure of 1 to every tautology — the analogue of P(W) = 1) on a probability measure defined over the set of all formulas. Then truth and falsity become the extreme probabilities of 1 and 0, respectively. This allows us to associate a probability P(s) with any sentence (event) s, and hence every sentence s automatically gets an information content in the usual way:
IC(s) = -log P(s). To say that the information content of a valid deductive argument with premises {p_1,…,p_n} and conclusion p is zero is simply to say that the conditional p_1 & … & p_n => p is a tautology. By definition, a tautology s has probability 1, and therefore IC(s) = 0.

[8] At this point the reader might ask: If deductive arguments convey zero information, why bother with them? Indeed, if all mathematical proofs are proofs of tautologies, with zero information content, what is their point? The thinking is that arguments with no information content are not useful, so if all deductive arguments (including all mathematical results) have zero information content, then they are not useful. This is, in brief, the so-called “scandal of deduction” (named by parity to the “scandal of induction,” i.e., Hume’s problem of induction). There have not been any widely accepted resolutions of this ostensible paradox. But few of course doubt that mathematical results are actually informative and extend our knowledge. (Surely if we woke up tomorrow and read that someone proved P != NP, that would be tremendously informative.) It’s also clear that the word “information” has a number of informal senses that are not captured by the canonical definition of information content (as the negative logarithm of probability), and most efforts to resolve the “scandal of deduction” have attempted to formalize distinct notions of informational gain that would render deductive arguments informative.

[9] Even from a purely linguistic viewpoint, it doesn’t seem appropriate to say that I have “concluded” or “derived” or “inferred” anything at all in the swan or in the plumber examples. I have simply made a tentative “hypothesis” (or “conjecture”), which might later be refuted.

[10] In the same way that even the process of discovering deductions is not itself deductive, at least not entirely so. Both are fundamentally search processes, though they are almost certainly informed (and generally penetrated) by deduction.

[11] This viewpoint assumes a functional-programming stance, but computation can be readily reduced to deduction in any other style of programming (e.g., imperative) by an appropriate axiomatic formulation of the relevant semantics (e.g., operational semantics using stores).

[12] Many shallow coding problems these days are essentially knowledge problems. What library or API can I use to do such and such? What configuration parameters are available and how can they be set? How do I zip or unzip files in Python? How do I read and write JSON or XML? How do I compute quantiles for a frequency table? Knowledge-heavy problems of this sort tend to be widely discussed on the web, and LLMs can be very effective productivity boosters for such problems (at least as long as this data remains freely available to companies such as OpenAI for pretraining purposes, something that might well change in the near future). Even conventional search engines like Google were already effective for these types of problems, prior to LLMs (and remain more effective than LLMs in many cases). But most interesting coding problems are reasoning-heavy. How can I make sure that this program produces correct outputs? How can I improve the asymptotic complexity of this program (where the program might contain many thousands of line of code)? And so on. If we are talking about self-contained and cookie-cutter components, like sorting algorithms, then these questions can often be reduced to knowledge-based questions. But the minute we start straying into unique situations with arbitrary specifications and code bases, we start facing the curse of general reasoning.

[13] A recent Wall Street Journal article interviewed editors who are “seeing a growing amount of AI-generated content that is so far beneath their standards that they consider it a new kind of spam,” a trend that is “growing exponentially.” The publishers interviewed for the article said that their publications “reject all AI-written submissions” and that these “are easy to identify.” They have “perfect spelling and grammar, but a completely incoherent story.” Another said “They’re all written in a rather bland and generic way. They are all grammatically correct. They just feel very formulaic, and they are really useless to us.”

[14] The former scenarios would be absurd even if AI technology had already attained superhuman intelligence, as LLMs do not have desires, in the same way that they don’t have beliefs or any other mental states. They do not actually want anything. To think otherwise is akin to thinking that a laptop that is simulating a hurricane will get wet (or, as Stephen Pinker has put it, thinking that because airplanes have now exceeded the flight ability of birds, they will suddenly start acting like eagles, swooping down from the sky to grab rabbits and squirrels). Genuine mental states can only be produced by brains, or by systems that have the same causal powers that brains have. Digital computers executing DNNs are not such systems.

[15] By “concrete counting” I mean counting a number of specific object tokens instantiated in space and time, as in the coins in one’s pocket or the number of lines in a text file. By contrast, abstract counting based on combinatorial principles, search procedures, and logical constraints (like the scheduling problem in Section 3.9) is indeed a reasoning activity.

Screenshots

All timestamps are in GMT.

Simple Arithmetic [Wednesday, July 5, 2023 12:06:50.583 AM]

Of course these are not reproducible results. Here is a sequence of two similar attempts in sequence from yesterday (Tuesday, August 8, 2023 5:41:41.451 PM). GPT-4 gets the first multiplication right and the second wrong.

Counting [ Wednesday, July 5, 2023 3:10:57.471 PM]

The version with negations:

More counting adventures [Thursday, July 6, 2023 6:42:10.796 PM]:

Counting greetings [ easily replicated variant from Tuesday, August 8, 2023 5:44:27.896 PM]

Of course we can combine simple counting and arithmetic tasks (this is from Wednesday, August 9, 2023 6:52:12.472 PM):

Alas, there are 14 a’s and 20 b’s in that string, for a total of 34.

It is trivial to generate endless variations of such examples.

(Medical) Common Sense [Thursday, July 6, 2023 1:00:26.455 AM]

Elementary Logic [Tuesday, March 28, 2023 2:28:49.796 PM]

Quantifier Semantics [Sunday, July 2, 2023 5:49:35.441 PM]

First problem:

Second problem [Wednesday, March 29, 2023 1:17:53.517 AM]

Continuing:

Simple Graph Coloring [Wednesday, April 5, 2023 12:14:50.412 AM]

Subset sum [Wednesday, April 5, 2023 1:26:00.927 AM]

Elementary Discrete Math [Tuesday, July 4, 2023 3:05:42.472 PM]

Simple Scheduling [Saturday, April 1, 2023 1:44:58.700 PM]

Russell’s Paradox [Wednesday, April 5, 2023 9:50:45.508 PM]

Blocks World [Tuesday, July 11, 2023 12:47:31.605 PM]

Spatial Reasoning [Saturday, July 8, 2023 5:52:34.900 PM]

Left and right:

Furniture arrangement [Tuesday, April 4, 2023 7:32:35.502 PM]:

Seating puzzle [Saturday, April 1, 2023 1:17:47.911 PM]:

Temporal Reasoning [Thursday, July 6, 2023 9:07:59.497 PM]

Murder Or Suicide? [Friday, June 30, 2023 7:38:27.821 PM]

Wason Selection Task [Saturday, July 1, 2023 3:12:25.731 PM]

Entropy [Sunday, April 2, 2023 1:57:40.714 PM]

Simple Compiler Correctness [Monday, July 10, 2023 1:35:08.941 AM]

Postscript

Some people have complained that they’re unable to reproduce the results described in the paper. This should not be surprising. The paper made an explicit disclaimer about it. Indeed, that’s a standard disclaimer made in every similar document that describes interactions with ChatGPT. (For instance, Stephen Wolfram writes “If you actually try these examples, don’t be surprised if they work differently (sometimes better, sometimes worse) from what I’m showing here. Since ChatGPT uses randomness in generating its responses, different things can happen even when you ask it the exact same question (even in a fresh session).”)

A paper by UK researchers that just came out a few days ago has the tongue-in-cheek title “LLM is Like a Box of Chocolates”. The paper investigates the instability of ChatGPT (both 3.5 and 4) specifically in the context of code generation. The authors argue that “results from LLMs can be highly unstable” and that “Non-determinism is a potential menace to scientific conclusion validity.” They conclude further that “the non-determinism issue of GPT-4 is also severe” (and is actually somewhat worse than that of GPT-3.5), and that, contrary to popular belief, determinism cannot be ensured even with a temperature of 0.

Indeed, the inherent instability of its behavior is yet another consideration militating against the use of GPT-4 in applications that demand stability, consistency, and soundness, which is to say most applications in science, engineering, and medicine.

Beyond that, it bears noting that people such as Rodney Brooks have claimed that “Open AI has people read interactions and fix the bad examples. Published ones are probably first to go. So it is a moving target. Wack-a-mole.”

That said, personally I have not found it difficult to replicate these issues. For instance, today I tried twice in a row the multiplication example, and GPT-4 got it right the first time and wrong the second time (screenshot shown above). Likewise for the counting examples (the Screenshots section includes a screenshot from today, Wednesday, August 9, 2023 6:52:12.472 PM). And there are many more examples of the same kind. For instance, the following is a trivial logic problem that was not even mentioned in the paper, but less than 2 weeks ago I got GPT-4 to claim that the following argument is deductively invalid:

Premise: The moon is square.
Conclusion: If the moon is not square, then the moon is square.

Of course this is a valid argument, since the premise is of the form S and the conclusion is ~S => S, i.e., S \/ S, which is to say S. GPT-4 starts out claiming “structural validity” based on a bogus explanation and then quickly changes its tune to claim deductive invalidity:

There are dozens of similar problems that I didn’t include in the paper because after a certain point there’s no need for them.

Is it possible that new versions of GPT-4 or various tweaks/hacks released by OpenAI might result in consistent behavioral changes to some of these particular problems? Of course. Are such releases likely to fix all possible issues and result in a systematically robust reasoner that can handle more or less any problem thrown its way? If one believes in Church’s thesis then that’s basically impossible, as explained in the introduction. What’s much more likely to happen in the near term are small adjustments that push here and pull there. I can’t make any claims about GPT-5, but for GPT-4 there is little doubt that even in the unlikely case that these particular problems are somehow fixed, it will be an easy exercise to generate similar problems at the same level of difficulty that will expose the same set of issues.