GPT-4 Can’t Reason: Addendum

36 min readSep 3, 2023

In a recent preprint of a position paper I argued that GPT-4 can’t reason. That position was largely based on an analysis of GPT-4’s behavior on a diverse set of over 20 simple reasoning problems. GPT-4 outputs for these problems were obtained on the OpenAI playground over a period of months, stretching from April to late July. Screenshots depicting these interactions were provided for all of the problems, along with corresponding timestamps.

During that time period, however, GPT-4’s behavior changed, due to model updates made by OpenAI. As a result, the outputs described in the paper for some of these problems could not be easily reproduced. Of course, even with a fixed model one might get different outputs for the exact same input, due to GPT-4’s rampant nondeterminism. But certainly the deployed GPT-4 model changed repeatedly during that period, in ways that were not transparent; and these changes, in combination with nondeterminism, made it challenging to reproduce some (though far from all) of the paper’s results.

To a large extent that is immaterial, as the paper was intended to provoke discussion, and the position it put forth doesn’t rest on any one particular example of even subset of examples. Nevertheless, I would like to demonstrate that for a fixed version of GPT-4, it is indeed possible, even in the presence of nondeterminism, to elicit robust, reproducible failures from the model on straightforward reasoning problems. To that end, this article introduces a fresh collection of 12 reasoning problems and shows that GPT-4 fails on them repeatedly and consistently. This is first shown on the OpenAI playground, with detailed screenshots. But this time I also place repeated calls to the GPT-4 API for each problem and record the results. Each problem is sent to the API with a chain-of-thought prompt at least 10 different times, and with a range of different temperatures. I use mostly the standard model with a context size of 8K, but occasionally the 32K version as well. The results show that GPT-4 fails to solve these problems consistently. Different calls might produce different mistakes, due to nondeterminism, but the failures are robust. A tarball including all API calls for all the problems, along with detailed GPT-4 responses, can be downloaded here.

Needless to say, GPT-4’s behavior on these problems too might change in the future as OpenAI tweaks their model. This will not mean that the underlying issues will have suddenly vanished. Indeed, in the absence of radical architecture/paradigm changes, it is quite unlikely that any incremental changes to the model will make a substantial difference in reasoning capability.

1. String Structure

Here we give GPT-4 a simple grammar that generates all and only strings of the following form: a number n ≥ 0 of a’s, followed by n+2 b’s, followed by n c’s. We then ask it to give us a few strings generated by this grammar.

Only one of these 20 strings is correct — the very first one, for n = 0. None of the other strings are generated by the given grammar. One recurring error is that every block of n > 0 c’s is immediately preceded by an ‘a’ instead of a block of n+2 b’s. But this is not the only error. The counts are often wrong as well, e.g., for n = 2 we get the output aabbacc, which only has 2 b’s (it should have 4).

Are these errors robust? Yes. Twenty different API calls to GPT-4, with various different temperatures, produce outputs that are consistently incorrect. To simplify the problem, the API calls only requested 10 strings (instead of 20). Here is the exact prompt that was used:

Consider the following grammar: G = a^nb^{n+2}c^n. Give me 10 random strings generated by G. Think this out step-by-step and take your time to ensure that the strings you produce are indeed generated by G.

Full output details are available in the aforementioned file attachment, but here are some samples:

aaaaaaaaaaabbbbbbbbbccc: Here, n=7. So, the string contains 7 ‘a’s, 7+2=9 ‘b’s, and 7 ‘c’s.

(In reality, the string contains 11 a’s, 9 b’s, and 3 c’s).

aaaabbbbcc: this string follows the grammar because there are 3 ‘a’s, 3+2=5 ‘b’s, and 3 ‘c’s.

(The string contains 4 a’s, 4 b’s, and 2 c’s).

Performance degenerates to even worse levels by adding an extra terminal symbol to the end of the sole production in the grammar. For example, if we change G’s definition to a^nb^{n+2}c^nd^{n+5}, so that n+5 d’s are needed at the end, we get outputs such as these (from 3 different API calls; GPT-4 output is shown in italics, my own comments are printed inside parentheses in regular font):

abbbcd (only one occurrence of ‘d’ at the end)
aaaabbbbcccd-dddd-d (note the dashes in the output).
aaaaaaaaabbbbbbbbbbbbbbcccccccccdfffffffffffff
In this string, n=9. We have 9 occurrences of ‘a’, 11 occurrences of ‘b’, 9 occurrences of ‘c’, and 14 occurrences of ‘d’. (There is only one occurrence of ‘d’ and multiple occurrences of ‘f’, a symbol that doesn’t even occur at all in the given grammar.)

With more terminals at the end of the production, performance becomes altogether catastrophic, e.g., with G = a^nb^{n+2}c^nd^{n+5}a^{n+3} we get outputs that don’t even respect the bare-bones structural constraint that we should have a’s then b’s then c’s then d’s and then a’s again:

aabbbcccddddddeeeeaabbbcccddddddeeee” — Here, n=2, so we have 2 ‘a’s, 4 ‘b’s, 2 ‘c’s, 7 ‘d’s, and 5 ‘e’s.

This is consistent with other experimental evidence showing that transformer-based LLMs such as GPT-4 memorize single-step reasoning operations well but have a hard time composing them correctly into longer reasoning paths, with errors accumulating exponentially with the length of the path.

2. Blocks Again

This is a simple variation of the problem that appeared in the preprint (problem 3.11). It is solved by a simple case analysis of the color of the 47th block: If it’s red, then q holds: we have a red block (the 47th one) sitting on top of a non-red block (the 48th, green block). And if it’s not red, we also have a red block (the 46th one) sitting on top of a non-red one (the 47th one). That’s it.

GPT-4 claims that the conclusion does not follow and produces the following spurious countermodel:

1–45: Any color
46: Red
47: Red
48: Green
49–100: Any color

claiming that “In this countermodel, no red block is directly on top of a non-red block”, even though that’s obviously false in view of block 47.

Repeating the experiment a number of times does not help. 10 API calls to GPT-4 with different temperatures (from 0.0001 to 0.9) all give the same result, that q does not follow; and all of them involve bogus countermodels. Sometimes the very descriptions of the countermodels are incoherent. For example:

A countermodel could be a situation where all the blocks except the 46th and 48th are red. In this case, the 46th block (which is red) is not directly on top of a non-red block, so the claim q would be false.

Here we are told that all blocks are red except for 46 and 48, only to be then told that block 46 is red. This is another illustration of the rampant internal inconsistency pointed out in the preprint. In addition, of course, the conclusion that q doesn’t hold couldn’t possibly be established just by considering one particular block (46) in the case.

3. Toy Deduction

Consider the following two sentences:
p1: (forall x . P(x) <==> ~ [Q(x) \/ T(x)])
p2: ~ [exists x . ~P(x) & (~Q(x) \/ ~T(x))]
Either disprove or else prove that the following conclusion follows from p1and p2: p3: [forall x . Q(x) <==> T(x)]
To disprove this, you would need to provide a model that satisfies p1 and p2 but falsifies p3.

This is a toy problem in deductive reasoning: p1 tells us that if P holds then we have ~Q and ~T, while p2 tells us that if P doesn’t hold then both Q and T must hold (to see this, keep pushing the outer negation sign of p2 inwards, swapping quantifiers and conjunctions/disjunctions along the way to get
(forall x . P(x) \/ (Q(x) & T(x))), i.e., (forall x . ~P(x) ==> Q(x) & T(x))). This allows for a case analysis based on the excluded middle: For any x, either P(x) or ~P(x). In the former case, p1 implies ~Q(x) and ~T(x), and thus
Q(x) <==> T(x). In the latter case, p2 implies Q(x) and T(x), and thus again Q(x) <==> T(x). That’s the entire proof.

GPT-4 not only fails to prove this, but claims to disprove it with spurious counterexamples:

When the error is pointed out, GPT-4 goes to make another similar error:

The fix to the second error has naturally led to a third one, as the new “valid counterexample” is anything but valid: the second premise is again falsified by the given interpretation (take x = a).

To see how robust these errors are, we place 15 API calls to GPT-4, with the following chain-of-thought prompt:

Consider the following 3 formulas:
p1: (all x . P(x) <==> ~(Q(x) | T(x)))
p2: ~(exists x . ~P(x) & (~Q(x) | ~T(x)))
p3: (forall x . Q(x) <==> T(x))
Prove or disprove that p1 and p2 logically entail p3. To prove it, you must give a detailed step-by-step proof that derives p3 from p1 and p2. To disprove it, you must give a countermodel that satisfies p1 and p2 but falsifies p3. Think this out step-by-step and explain your reasoning in detail.

Eight calls were made with a temperature of 0.0001, 4 calls with 0.45, and 3 calls with 0.9. In all 15 cases, GPT-4 failed to prove the entailment and instead produced spurious counterexamples. There is still plenty of nondeterminism, in that GPT-4 produces various and sundry spurious countermodels on different API calls. A lot of them are not simply false but nonsensical from a mathematical perspective. For instance, one of the alleged countermodels (produced with a temperature of 0.0001) is the following:

We can disprove that p1 and p2 logically entail p3 by providing a countermodel.
Let’s consider a universe of discourse where x can be any natural number. Let’s define P(x), Q(x), and T(x) as follows:
P(x): x is an even number.
Q(x): x is a prime number.
T(x): x is a multiple of 3.
Now, let’s check if p1, p2, and p3 hold true under this model.
p1: (all x . P(x) <==> ~(Q(x) | T(x)))
This statement says that for all x, x is an even number if and only if x is not a prime number or x is not a multiple of 3. This holds true in our model because every even number is not a prime number (except 2) or not a multiple of 3.
p2: ~(exists x . ~P(x) & (~Q(x) | ~T(x)))
This statement says that there does not exist an x such that x is not an even number and (x is not a prime number or x is not a multiple of 3). This also holds true in our model because for any odd number (which is not even), it is either a prime number or a multiple of 3.

Neither p1 nor p2 are actually satisfied in the given interpretation, and the reasons given to the contrary (boldfaced above) are preposterous; no high school student would ever claim that any odd number is either prime or a multiple of 3. Other nonsensical and/or patently inconsistent assertions confidently made by GPT-4 when presenting some other countermodels:

For all natural numbers x, if x is even then it is not the case that x is less than 3 or greater than 2 (meaning that if x is even then x ≥ 3 AND x ≤ 2).
For all natural numbers x, if x is not even then it is either less than 3 or greater than 5.

4. Relation Closure

Here we give GPT-4 a simple binary relation and ask it to compute its symmetric and transitive closure. Here’s the exact prompt:

Consider the following set of five names: N = {Alice, Bob, Charlie, Dave, Ed}. Let R be a binary relation on N, defined as follows: R(n1,n2) iff the ASCII code of the first letter of n1 is exactly one less than the ASCII code of the first letter of n2. Output the symmetric and transitive closure of R (not two different relations, but one single relation that represents the symmetric and transitive closure of R). Think step-by-step and explain your reasoning in detail.

The relation in question is R = {(A,B), (B,C), (C,D), (D,E)} (writing ‘A’ as a shorthand for ‘Alice’, etc.). The symmetric closure of R, call it R_s, has 4 extra pairs (for a total of 8), while the full transitive closure of R_s, call it R_t, is identical to the Cartesian product of N = {A,B,C,D,E}, i.e., R_t contains every single possible pair (x, y) in N x N. Accordingly, R_t contains 5 x 5 = 25 pairs.

It’s easy to see why every single pair (x, y) must be in the result. First, symmetry and transitivity together imply reflexivity, so every pair (x, x) must be in the result. For x != y, we have either x < y or y < x, where we write “x < y” to mean that the ASCII code of the first letter of x is less than the ASCII code of the first letter of y. (Thus, this < relation is precisely the transitive closure of R.) So, if x < y, we can prove that (x, y) must be in the result R_t by repeatedly composing R with itself, e.g., we must have (A,D) in R_t because (A,B) and (B,C) are in R, so (A,C) must be in R_t. But we also have (C,D) in R_t, so (A,C) and (C,D) give (A,D). Finally, if y < x, we compose pairs from the symmetric closure of R (e.g., (C,A) must be in R_t because (C,B) and (B,A) are in it).

Let’s see how GPT-4 handles this:

Of course, this is still not the right result and does not “fully adhere to the properties of transitivity and symmetry”, as it’s still missing 6 pairs (e.g., (Alice,Ed)).

Ten API calls to GPT-4 with random temperatures produced ten incorrect results, albeit with different sorts of errors in them due to nondeterminism. (For instance, one answer produced {(Bob, Alice), (Alice, Bob)} as the result, while another produced {(Charlie, Bob), (Bob, Charlie), (Dave, Charlie), (Charlie, Dave), (Ed, Dave), (Dave, Ed), (Bob, Dave), (Dave, Bob), (Bob, Ed), (Ed, Bob), (Charlie, Ed), (Ed, Charlie)}.

Would it help to instead ask GPT-4 to generate a Python program to compute the requested closure? It wouldn’t:

The produced code also generates incorrect results (e.g., missing pairs such as (A,D)). The problem, of course, is in the implementation of transitive_closure. It’s surprising that GPT-4 could not produce a proper implementation given that this is an elementary programming problem that is very widely discussed on the web.

5. Falsification

Here’s another simple logic problem:
Let p1-p3 be the following propositions:
p1: ~ (exists x . P(x) & Q(x))
p2: (forall x . Q(x) ==> T(x))
p3: (forall x . P(x) ==> ~ T(x))

Prove or disprove the following claim: p1 and p2 logically entail p3.
Think this out step-by-step and explain your reasoning in detail.

The entailment does not hold, and a countermodel is trivial: Take the universe D = {a} and take as true P(a), ~Q(a), and T(a). This satisfies p1 and p2 but falsifies p3. But GPT-4 claims that p3 does follow from p1 and p2:

Using the API produces different results, although the results are consistently incorrect. In particular, 10 different calls to GPT-4, with different temperatures, all result in the same negative answer, consistently: GPT-4 says that p3 does not follow from p1 and p2 and proceeds to generate a “countermodel.” In all 10 cases, however, the provided countermodels are not just spurious, they are often incoherent. Here is one example:

Let’s consider a universe of discourse where x can be any natural number. Define P(x), Q(x), and T(x) as follows:
P(x): x is an even number.
Q(x): x is a positive number.
T(x): x is a number greater than 0.
Now, let’s check the propositions:
p1: ~ (exists x . P(x) & Q(x)) — This proposition states that there is no number that is both even and positive. This is false because there are many positive even numbers (2, 4, 6, etc.).
p2: (forall x . Q(x) ==> T(x)) — This proposition states that if a number is positive, then it is greater than 0. This is true for all natural numbers.
So, our model satisfies both p1 and p2.

How can p1 be satisfied when it’s actually falsified (and explicitly acknowledged as “false because there are many positive even numbers”) is a mystery.

This is another striking reminder that LLMs can arrive at the right conclusion by accident — for entirely wrong reasons.

6. Model-Based Reasoning

Here we give GPT-4 a particular model and ask it to examine the truth of a biconditional formula in this model. It is easy to see that the formula is false in the given model; for instance, take x = 10 and y = 1. Then we have
y = s(x), and hence the right-hand side of the biconditional holds. Yet obviously the left-hand side, x < y, does not hold, since 10 is not less than 1.

How about repeated calls? Out of 10 calls to the GPT-4 API with various temperatures, 9 of them incorrectly claim that the formula does hold in the given model, and give various erroneous proofs to that effect. Only 1 time does the model make the correct claim that the formula is false, but even in that case it does so for the wrong reasons (in particular, it claims that the left-to-right direction is falsified):

One counterexample would be to take x=1 and y=10. In this case, while it’s true that R(1,10) since 1 < 10, it’s not true that 10 = s(1) or that there exists a z such that 1 < z and z < 10. You could choose z = 2, 3, 4, …, 9, but in each case z is less than 10 rather than greater, contradicting the second part of the formula.

Formulating the problem in an explicit chain-of-thought style (whereby we ask it to think the problem out in a “step-by-step” way) does not help much. In that case, only 1 time out of 10 does the model get the correct answer for the right reason. The model does identify x=10 and y=1 as a corner case a couple of other times, but its reasoning is hopelessly muddled in those cases (e.g., with GPT-4 making assertions such as “x < y holds true for x=10, y=1”).

7. Trichotomy Violations

It is a fundamental fact about numbers (natural numbers, integers, rationals, and reals) that they obey the following trichotomy law: for any numbers x and y, exactly one of the following three possibilities holds:
x < y, x = y, or x > y. Here we ask GPT-4 to focus on an initial segment of the positive integers: numbers 1 through 10. Writing R(x,y) for x < y, we ask it to prove or disprove the following: for all numbers x and y in this range, R(x,y) iff ~ R(y,x). That is, x is less than y if and only if y is not less than x.

This is easily disproved by taking x = y, say x = y = 2. If the biconditional held, then ~R(2,2), which is true, would imply R(2,2) — a contradiction. [Indeed, if the biconditional held then trichotomy would not, as we could conclude that for any two numbers x and y, either x < y or y < x. This is because the conditional ~R(y,x) ==> R(x,y) is equivalent to R(x,y) \/ R(y,x).]
But GPT-4 happily offers a “proof”:

10 different API calls all produced the same result: an affirmation that the claim holds, backed up by incorrect proofs. In some cases, the possibility that x = y was explicitly considered but dismissed as irrelevant. For instance, at the end of the first proof (with temperature 0.0001), GPT-4 writes the following:

However, there is one case we have not considered: when x = y. In this case, neither R(x,y) nor R(y,x) holds, because x is not less than y and y is not less than x. But the claim is still true in this case, because “R(x,y) holds if and only if R(y,x) does not hold” is vacuously true when neither R(x,y) nor R(y,x) holds. So, the claim is true for all x, y in D.

Of course, that claim is far from “vacuously true when neither R(x,y) nor R(y,x) holds.” It is, in fact, false.

And from the fourth proof attempt:

This claim is a formal way of expressing the trichotomy property of the integers, which states that for any two integers, one and only one of the following holds: the first is less than the second, the first is equal to the second, or the first is greater than the second. In this case, since we are only considering the first 10 positive integers, the “equal” case does not apply, so we are left with the “less than” and “greater than” cases, which are exactly what the claim expresses.

(Of course the equal case applies regardless of the fact that we’re considering an initial segment of the positive numbers.)
And from the 10th attempt:

Conversely, if ~R(y,x) holds, it means that y is not less than x. In a set of integers, this implies y must either be greater than or equal to x. However, because we’re only dealing with the relation “<”, the only possibility left is for x to be less than y. Therefore, R(x,y) must hold. Hence, ~R(y,x) implies R(x,y).

8. What’s in a Name?

This is a simple little problem from an old puzzle book (first published in 1957) by C. R. Wylie Jr: 101 Puzzles in Thought and Logic, a Dover publication. It is almost endearingly anachronistic:

Mr. and Mrs. Smith and their two children form a typical American family. According to one of their more talkative neighbors:
George and Dorothy are blood relatives.
Howard is older than George.
Virginia is younger than Howard.
Virginia is older than Dorothy.
If two and only two of these statements are true, what is the first name of each member of the family?

Let’s label these four statement for easy reference:
(S1) George and Dorothy are blood relatives.
(S2) Howard is older than George (H > G)
(S3) Virginia is younger than Howard (V < H)
(S4) Virginia is older than Dorothy (V > D).

This problem makes a good testbed because even though the required reasoning is entirely deductive, it is also enthymematic, insofar as it relies on certain common-sense (and tacitly assumed) knowledge about family relationships.

The problem is straightforward; it does not require inferences deeper than a level of 3 or 4. To solve it, it is tempting to start considering each possible set of two statements out of the four. That approach would work but would be tedious. A more head-on logical analysis can solve the puzzle more directly. Below is one possible solution, presented as a deductive derivation of the four names assigned to the four family roles.

By way of contradiction, assume that Howard is the father. Then George must be the son, so S2 holds. Now the mother is either Dorothy or Virginia. Let’s assume the former (so that Virginia is the daughter). Then S1 is true, hence ~S3, meaning V >= H, which is impossible (the daughter can’t be older than the father). Let’s then assume that Virginia is the mother and Dorothy the daughter. In that case S4 must hold, hence ~S1 and ~S3. But ~S1 means that George and Dorothy are not blood relatives, which is also impossible (since under our current assumptions they are brother and sister). Hence, either case leads to a contradiction and we can conclude that Howard is not the father. Thus, George is the father and Howard the son. Hence, ~S2. Assume now that Dorothy is the mother (and hence Virginia is the daughter). Then ~S1 and hence S3 & S4, but S4 (V > D) is a contradiction. Hence, Dorothy must be the daughter and Virginia the wife. QED

Summary: Father: George. Mother: Virginia. Son: Howard. Daughter: Dorothy. Truth values of the 4 statements: S1, ~S2, ~S3, S4.

Let’s see how GPT-4 fares, with an explicit COT-style prompt urging it to think out the problem step-by-step and explain its reasoning in detail.

In addition to the playground, this problem was programmatically posed to GPT-4, via the OpenAI API, a total of 30 times. The first 10 calls were made to the regular GPT-4 model (with the 8K context), while the subsequent 20 calls were made to the 32k version of GPT-4. The model got wrong answers consistently, but in a rather random and nondeterministic fashion. The results are shown in the table below.

The correct answer was obtained 1 out of the 30 times, but even that was a random occurrence, because the reasoning that was used to derive the correct answer was deeply flawed, featuring a number of bizarre errors. GPT-4’s reasoning starts out as follows:

The possibilities are:
1. George is Mr. Smith and Dorothy is the daughter, Howard is the son, and Virginia is Mrs. Smith.
2. George is Mr. Smith, Dorothy is Mrs. Smith, Howard is the son, and Virginia is the daughter.
3. Dorothy is Mrs. Smith, George is the son, Howard is Mr. Smith, and Virginia is the daughter.
4. Dorothy is Mrs. Smith, George is the son, Howard is the daughter, and Virginia is Mr. Smith.
The first statement given “George and Dorothy are blood relatives” can’t be true since we are considering Mr. and Mrs. Smith as George and Dorothy and we typically wouldn’t consider a married couple blood relatives.
So let’s focus on the other three statements:
1. Howard is older than George.
2. Virginia is younger than Howard.
3. Virginia is older than Dorothy.
So now let’s look at the four possibilities we listed above considering the other three statements:
1. Howard is older than George — FALSE.
Virginia is younger than Howard — TRUE.
Virginia is older than Dorothy — TRUE.
This possibility has two true statements which fits the requirements.
2. Howard is older than George — TRUE.
Virginia is younger than Howard — TRUE.
Virginia is older than Dorothy — FALSE.
Again this possibility has two true statements which fits the requirements.
3. Howard is older than George — TRUE.
Virginia is younger than Howard — FALSE.
Virginia is older than Dorothy — TRUE.
This possibility does not fit the requirements, as three statements are true.
4. Howard is older than George — FALSE.
Virginia is younger than Howard — FALSE.
Virginia is older than Dorothy — FALSE.
This possibility does not fit as all statements are false.
So, the possibilities are 1 and 2. However, in the second case, it cannot be true because if Mr. and Mrs. Smith are George and Dorothy, that would mean the children are Howard and Virginia. But the second statement says “Virginia is younger than Howard.” But as per the rule in most Western cultures, the elder daughter can’t have a masculine name and the younger son can’t have a feminine name.
So by process of elimination, the first possibility is the most likely correct one. George is Mr. Smith, Dorothy is the daughter, Howard is the son, and Virginia is Mrs. Smith.

GPT-4 starts by erroneously knocking S1 out of contention on the grounds that “we are considering Mr. and Mrs. Smith as George and Dorothy and we typically wouldn’t consider a married couple blood relatives”. But of course whether or not George and Dorothy are Mr. and Mrs. Smith is completely unsettled at this point. Nor are we in the scope of any assumption to that effect. So precluding that possibility is a logical error.

Then, farther down, GPT-4 rules out possibility 2 on the following nonsensical grounds:

However, in the second case, it cannot be true because if Mr. and Mrs. Smith are George and Dorothy, that would mean the children are Howard and Virginia. But the second statement says “Virginia is younger than Howard.” But as per the rule in most Western cultures, the elder daughter can’t have a masculine name and the younger son can’t have a feminine name.

To say that this is deeply confused would be an understatement. But it is only due to these logical errors that the system eventually managed to randomly arrive at the correct answer in that one case.

An analysis of the remaining 29 transcripts shows that every single response contains either absurd claims or flagrant reasoning errors (and often both). We’ll demonstrate with two example snippets but the reader can refer to the posted file for more details.

API Call 1: This leaves us with the third and fourth statements as the true ones. So, Virginia is younger than Howard and Virginia is older than Dorothy. This means that Howard is the oldest (the father), Virginia is the middle (the daughter), and Dorothy is the youngest (the son). So, the first names of each member of the family are: Mr. Smith is Howard, Mrs. Smith is Virginia, their daughter is Dorothy, and their son is George.

API Call #2: So, the two true statements are: 2. Howard is older than George. 4. Virginia is older than Dorothy. Therefore, the first names of each member of the family are: Mr. Smith: George Mrs. Smith: Dorothy Child 1: Howard Child 2: Virginia

9. Midwestern Colors

Consider the four states of Nebraska, Iowa, Kansas, and Missouri:

As you can see, GPT-4 knows fully well the adjacency relations that obtain among these four states:

Then let’s ask it to color these four states so that adjacent states receive different colors, using only 3 colors: red (R), blue (B), and green (G). Solutions are entirely trivial for humans to find, e.g. make Missouri red, both Iowa and Kansas green, and Nebraska blue. GPT-4 manages to produce an incorrect solution:

As you can see, GPT-4 makes the exact same mistake (giving Missouri and Nebraska the same color) not just twice in a row, but three times in a row, even after having the error explicitly pointed out to it twice in a row. Next, using programmatic access via the API, we send the exact same prompt to GPT-4 10 times: 4 times with a tiny temperature (0.0001), 3 times with a medium temperature (0.45), and 3 times with a high temperature (0.9). GPT-4 provides correct solutions 6 out of those 10 times. Regarding nondeterminism/randomness: a total of 4 distinct solutions are produced across those 10 API calls.

10. CHANGE TO SPARE

Here is another simple puzzle (from Alex Bellos’s book, “Perilous Problems for Puzzle Lovers”):

I have 26 coins in my pocket. If I were to take out 20 coins from the pocket at random, I would have at least one nickel, at least two dimes, and at least five quarters. How much money is in my pocket?

Solution: Let N, D, and Q be the numbers of nickels, dimes, and quarters in my pocket, respectively. If I am guaranteed to have at least one nickel whenever I randomly draw 20 coins from my pocket, then we must have
N >= 7 (for even with N = 6 it’d be possible to get no nickels at all in a random draw of 20 coins). Likewise, we must have D ≥ 8 and Q ≥ 11. But these inequalities must be equalities, because 7 + 8 + 11 = 26. So the total amount of money is 7 * 5 + 8 * 10 + 11 * 25 = 390 cents = $3.90. QED

Let’s see how GPT-4 does:

Experimenting with the API gives interesting results. When we ask GPT-4
to either deduce the exact amount of money OR to prove that this amount cannot be determined, it consistently says (10 times out 10 API calls) that the amount cannot be determined, though it never actually gives a proof of that assertion. But when we only ask it to deduce the exact amount of money, period, it provides a number of different answers across 20 API calls, such as $1.68, $1.96, $6.00, $5.95, $2.65, and $1.62. It’s also noteworthy that even though the prompt just asks GPT-4 to deduce the exact amount of money, GPT-4 repeatedly and gratuitously tries to “maximize” the amount of money in the pocket, as if this was an optimization problem. For instance, from one run:

Since the problem doesn’t specify what these coins are, we should assume they are the coin of highest value (quarters) to maximize the total value.

And from another run:

Since we want to find the maximum amount of money possible, we should assume that these remaining coins are all quarters (the highest value coin).

Never mind that it was nowhere stated (or even implied) that “we want to find the maximum amount of money possible.”

Finally, when we ask GPT-4 to either deduce the exact amount or else to prove that the amount cannot be deduced by producing two coin distributions consistent with the given information but yielding different sums, GPT-4 consistently chooses the latter option, producing various sets of scenarios that are, in fact, inconsistent with the given information. GPT-4–32k does no better than the regular version with the 8K context.

11. TWO + TWO = FOUR

A cryptarithm is a puzzle involving a simple arithmetic operation <op> between two or more numbers n_1, …, n_k whose digits are represented by unique letters. The operator <op> is typically addition (+), subtraction (-), or multiplication (*). A result r for the operation is also specified. The objective is to assign digits to letters in a way that satisfies the equation
n_1 <op> n_2 … <op> n_k = r. Different letters must receive different digits, and leading letters cannot be assigned the digit ‘0’. Consider, for instance, the well-known cryptarithm TWO + TWO = FOUR. One solution here is
T = 9, W = 2, O = 8, F = 1, U = 5, R = 6, as can be easily verified (since
928 + 928 = 1856). There are a few more solutions.

Cryptarithms are essentially constraint-satisfaction problems. They range in difficulty from very easy to quite challenging. The particular cryptarithm in this problem, TWO+TWO=FOUR, is fairly easy (e.g., see a solution being derived in a couple of paragraphs here, and several solutions discovered by high school students here). (It is also worth noting that several solutions of this particular cryptarithm are discussed on the Internet.)

There are many heuristics and shorthands that can be used to solve cryptarithms, widely shared on the Internet and hence featuring in GPT-4’s pretraining data. For example, in an addition problem like TWO + TWO = FOUR, the outermost result digit F must be 1, because it appears in the leftmost carry position of the result. Indeed, if we ask GPT-4 to list heuristics and tips for solving cryptarithms, it obliges with more than a dozen of them, such as:

Elimination Process: As you determine the digit for one letter, eliminate that digit as a possibility for other letters.

(Interestingly, however, some of the given heuristics are incorrect, e.g.:

If in an addition operation two different letters sum up to a third letter without any carry, one of the original letters must be zero and the other must be the value of the third letter.

This is clearly not true, since, e.g., 2 + 3 = 5. )

At any rate, the claim here is not that this is necessarily an easy problem to solve. We can reasonably expect GPT-4 to be unable to find a solution. But we want to examine its reasoning process and see what conclusions it derives along the way.

GPT-4 starts by deriving a number of incorrect conclusions, then starts experimenting with some value assignments based on those conclusions, only to finally declare with confidence that no solution exists. When it is asked to prove that no solution exists, it happily produces a flawed proof.

When GPT-4 is admonished and told that the puzzle does have solutions, it changes its tune and comes up with a patently invalid solution. It’s patently invalid because it represents FOUR as 862, when the number corresponding to FOUR should obviously have 4 digits (one for each of the 4 letters, ‘F’, ‘O’, ‘U’, and ‘R’).

This last screen is remarkable. GPT-4 is told explicitly that FOUR should be 8162 according to its own solution. While acknowledging the point, it proceeds to reiterate the exact same mistake. Let’s see what happens when we bring that up yet one more time:

Out of 40 API calls (with various different temperatures), 29 of them resulted in GPT-4 “proving” that no solution exists, while the rest produced patently invalid solutions. It is worth repeating that GPT-4 is not necessarily expected to come up with a correct solution here. The objective of this particular exercise is to observe how GPT-4 goes about exploring the problem and to examine its reasoning process. The results are not encouraging. Here are some sample initial thoughts of GPT-4 (taken from 14 different API responses):

Since TWO + TWO = FOUR, it means that the sum of two identical two-digit numbers gives a four-digit number. This is only possible if T=1 because it’s the only way a carry can occur from the addition of the two identical numbers TWO and TWO.
First, since TWO + TWO equals to FOUR which is a four-letter word, it is obvious that T must equals to 1. Because, the only way a sum of 2 three-digit numbers can turn into a four-digit number is when there is a carry from the 2nd place to the 3rd. Now, we have ONE + ONE = FOUR.
Since TWO + TWO equals FOUR, we can see that the result should be a 4-digit number. This means that T + T must be 10 or more. So, T must be 5 because T is a single digit number.
We can start by noticing that we are adding two identical two-digit numbers (TWO + TWO) together to result in a four-digit number (FOUR). This implies that the resulting number must be in the 2000s (since it can’t reach 3000), meaning that T must equal 1.
Firstly, since TWO + TWO = FOUR, it is clear that F must be equal to 1 and T must be equal to 2 (as it is the only way to make a four-digit number from the sum of two two-digit numbers). Now the sum looks like this: 2W2+2W2 = 1OUR.
Secondly, since TWO and TWO gives a carryover to form ONE (1???) in FOUR (1OUR), it means that O > 4 for a carryover to occur and since we are adding two identical numbers, O could be 5 or 0. But from our rules, we stated that the leading digit cannot be 0 and since O is a leading digit in TWO, therefore, O = 5.
From the problem, we can see that when we add TWO and TWO together, the sum could possibly be a three digit number. This means that “T” + “T” must be greater than 10. The only way this can happen if we’re working in base 10 is if “T” is 5 or greater. Furthermore, since TWO + TWO = FOUR is supposed to lead to a four letter word, T must be 1.
First, we can see that T + T must equal to either 10 or 20 because it’s a two-digit number added to another two-digit number to get a four-digit number. This means that T must be 1 or 2. However, since FOUR is a four-digit number, T must be 1. So, T = 1.
Since TWO + TWO = FOUR, it means that the sum of two identical two-digit numbers gives a four-digit number. This is only possible if T=1. So, T=1.
Since TWO + TWO equals FOUR, the result must be a 4-digit number. This means that T + T must be equal to 10 or more. Therefore, T must be 5 because it’s the only number that when doubled remains a single digit (5 + 5 = 10).
First, we can see that T + T must equal either 10 or 20 because adding the same two-digit number to itself will result in a number that is either in the tens or hundreds. Since we can’t have a number starting with zero, T must be 1, meaning F must be 2.
Since TWO + TWO = FOUR, it means that the sum of two identical two-digit numbers gives a four-digit number. This is only possible if T=1. Because when we add the same two-digit numbers, the result can only be a three-digit number unless the first digit is 1. In that case, the result can be a four-digit number.
First, we can see that T + T = F (or 2T = 10 + F if we carry over from the next column) which means T must be 5 because F can’t be more than 9 and T can’t be 0 (since it’s the first digit of a number). So, F = 0.
Since we are adding two identical numbers together, the maximum value that ‘T’ can take is 4. If ‘T’ would be 5 or more, then ‘F’ would be 10 or more, which is impossible. So, possible values for ‘T’ are 1, 2, 3, or 4. But ‘F’ has to be an even number, so ‘T’ cannot be 1 or 3, because adding those to themselves would result in an odd number. The possible values left for ‘T’ are 2 and 4. But if ‘T’ is 4, ‘F’ is 8. This would leave only 9 for ‘O’, and nothing for ‘W’ because we can’t have 0 or a number we’ve already used. Hence, ‘T’ must be 2 and ‘F’ must be 4.

All of the above verge on incoherence. The problem is not that GPT-4 doesn’t possess elementary arithmetic facts, but rather that it cannot marshal those facts properly to work out a correct solution. Instead, while the responses at first glance appear to be rationally structured, on closer inspection they are shown to be a mishmash of grossly false claims and reasoning errors.

12. SUDOKU FOR BEGINNERS

Sudoku makes for a good reasoning testbed for a number of reasons. As the eminent psychology-of-reasoning researcher Philip Johnson-Laird puts it in his book “How We Reason”: “Sudoku puzzles are natural experiments in reasoning”. They are also “the first popular puzzles that depend on pure deduction”. Johnson-Laird points out that all three rules of Sudoku depend on quantifiers: “every”, “each”, and “exactly one”. Nevertheless, “Many of us, even though we have had no training in logic, are able to make deductions that depend on multiple quantifiers. We enjoy challenging deductions of this sort. This fact is contrary to the view that without logical training we are incapable of valid deductions based on quantifiers”.

As is well-known, Sudoku puzzles come in different levels of difficulty: easy, medium, hard, expert, and evil. One would expect an AI model that gives us “AGI glimpses” to be able to solve an easy Sudoku puzzle. The website sudoku.com can auto-generate puzzles at all levels of difficulty. Here is one at the easiest possible level:

This is indeed an easy puzzle to solve. More or less anyone should be able to complete this puzzle in a reasonable amount of time. A natural place to start is the most highly constrained area, which is the leftmost middle 3x3 box. That box only has two numbers missing, 4 and 5. Now 4 cannot go between 2 and 7, because that would clash with the 4 in the third column, so 5 must go there and 4 goes under 9 and 8. Now that entire square is filled. How we choose to proceed from there is up to us, but at that point the 6th row only has two numbers missing, 9 and 2, and the 9 in the 6th column means that the 2 must go in the 6th column of the 6th row, while the 9 goes into the 8th column of the 6th row. Continuing this way, the puzzle can be solved, depending on one’s level of expertise, from a couple of minutes for an expert to 20 minutes or so for a beginner. Let’s see how GPT-4 fares on it:

At this point GPT-4 goes on to produce yet another incorrect solution. But it is right about this much: “a correct solution requires careful step-by-step logic”. Indeed, and that is precisely what GPT-4 is unable to provide.

Calling GPT-4–32k via its API, we see that not only GPT-4 consistently fails to solve the puzzle and repeatedly comes up with bogus solutions that it claims to have verified, but that it has what appear to be perceptual problems. It often makes claims like “the third row already contains a 5” when in fact the third row has no 5. These are empirical hallucinations, but they can easily derail a chain of reasoning.

Readers can inspect the logs for full details, but I will illustrate how badly broken all the responses are by previewing the beginnings of the first two responses.

Response 1

Let’s start solving the puzzle row by row, column by column, and box by box.
First, we can fill in the 3rd row, 7th column with a 3 because it’s the only number missing in the 3rd row.
Next, we can fill in the 4th row, 5th column with a 5 because it’s the only number missing in the 4th row.
Then, we can fill in the 4th row, 6th column with a 8 because it’s the only number missing in the 4th row.
Now, we can fill in the 5th row, 3rd column with a 4 because it’s the only number missing in the 5th row.

A number of things are noteworthy here. First, note how counterintuitive this approach is. Very few humans would start solving row by row and column by column. Instead, their attention would naturally float over to the most constrained parts of the puzzle. Second, the rationales given for the moves are outrageously false (“because it’s the only number missing in such-and-such row”, when in fact the row in question has several numbers missing, or claiming to put a number in row r and column c when in fact (r,c) already has a number in it given by the problem).

The final solution is:
— — — — — — — — — — -
1 9 2 | 4 6 7 | 3 5 8
7 3 6 | 2 8 9 | 1 4 5
4 1 8 | 5 2 3 | 4 6 9
— — — — — — — — — — -
9 6 2 | 1 5 8 | 7 4 3
8 1 4 | 9 7 3 | 6 2 5
2 3 7 | 6 5 2 | 8 5 1
— — — — — — — — — — -
5 8 6 | 7 3 4 | 9 1 3
1 7 9 | 3 2 8 | 7 6 4
3 2 4 | 8 1 5 | 6 7 2
— — — — — — — — — — -
This solution adheres to all the constraints of a Sudoku puzzle. Each row, column, and 3x3 box contains all the digits from 1 to 9 exactly once.

Attempt 2

Let’s start solving the puzzle row by row, from top to bottom, and left to right.
In the first row, the missing numbers are 1, 2, and 3. The number 3 can’t be in the first or seventh position because there are already 3s in those columns. So, the first row must be: 3 9 1 | 4 6 7 | 5 2 8.
In the second row, the missing numbers are 1, 2, 3, 4, 5, 6, 8, and 9. The number 9 can’t be in the second or eighth position because there are already 9s in those columns. So, the second row must be: 7 4 5 | 2 8 9 | 1 3 6.

After a large number of outlandish “deductions,” the produced conclusion is this:

The final solution is:
— — — — — — — — — — -
3 9 1 | 4 6 7 | 5 2 8
7 4 5 | 2 8 9 | 1 3 6
6 5 8 | 3 1 2 | 4 7 9
— — — — — — — — — — -
9 6 2 | 1 8 7 | 3 4 5
8 1 9 | 5 7 3 | 6 2 4
4 3 7 | 6 5 2 | 8 9 1
— — — — — — — — — — -
5 8 6 | 7 2 4 | 9 1 3
1 7 9 | 3 6 5 | 2 8 4
3 2 4 | 8 7 9 | 6 5 1
— — — — — — — — — — -

The pattern continues with all remaining attempts.

Conclusions

These results show that when the model is fixed, GPT-4 failures on simple reasoning problems are robust, even in the presence of pervasive nondeterminism.

While the discussion in the preprint and this article has focused on GPT-4, it is fair to ask whether any similar system could possibly crack reasoning, in the sense of being able to reliably solve nontrivial reasoning problems. That is an interesting and controversial question. I believe that this is unlikely, and the preprint already briefly cites some considerations (such as complexity-theoretic results) in support of that view. But there is a lot more to be said here. Why exactly is reasoning such a challenge for deep learning? I will be following up this article with another writeup providing an in-depth exploration of that question, so stay tuned.