Can GPT-3 Make Analogies?

Melanie Mitchell
14 min readAug 6, 2020

By Melanie Mitchell

Copycat Letter-String Analogies

In the early 1980s, Douglas Hofstadter introduced the “Copycat” letter-string domain for analogy-making. Here are some sample analogy problems:

If the string abc changes to the string abd, what does the string pqr change to?

If the string abc changes to the string abd, what does the string ppqqrr change to?

If the string abc changes to the string abd, what does the string mrrjjj change to?

If the string abc changes to the string abd, what does the string xyz change to?

If the string axbxcx changes to the string abc, what does the string xpxqxr change to?

The purpose of this “microworld” (as it was called back then) is to model the kinds of analogies humans make in general. Each string in an analogy problem represents a miniature “situation”, with objects, attributes, relationships, groupings, and actions. Figuring out answers to these problems, it was claimed, involves at least some of the mechanisms of more general analogy-making, such as perceiving abstract roles and correspondences between roles, ignoring irrelevant aspects, and mapping the gist of one situation to a different situation. In Chapter 24 of his book Metamagical Themas, Hofstadter wrote a long, incisive analysis of human analogy and how the Copycat domains captures some key aspects of it. The letter-string domain is deceptively simple — these problems can capture a large range of complex issues in recognizing abstract similarity. Hofstadter and his students (myself among them) came up with thousands of different letter string analogies, some of them extraordinarily subtle. A small collection of examples is given at this link.

The Copycat Program

I started working as a research assistant for Douglas Hofstadter at MIT in 1983. In 1984 I followed him to the University of Michigan and started graduate school there. My PhD project was to implement a program (called Copycat, naturally) that could solve letter-string analogy problems in a cognitively plausible manner. The goal wasn’t to build a letter-string-analogy-solver, per se, but to develop an architecture that implemented some of the general mechanisms of analogy-making, and to test it on letter-string problems. I won’t describe the details of this program here. I wrote a whole book about it, and also Hofstadter and I wrote a long article detailing the architecture, its connections to theories about human perception and analogy-making, and the results we obtained. In the end, the program was able to solve a wide array of letter-string analogies (though it was far from what humans could do in this domain). I also did extensive surveys asking people to solve these problems, and found that the program’s preferences for answers largely matched those of the people I surveyed. Later, Jim Marshall (another student of Hofstadter) extended my program to create Metacat, which could solve an even wider array of letter-string problems, and was able to observe its own problem-solving (hence the “meta” in its name). Other projects from Hofstadter’s research group, using related architectures, are described in Hofstadter’s book Fluid Concepts and Creative Analogies.


GPT-3 (“Generative Pre-Trained Transformer 3) is a language model — a program that is, given an input text, trained to predict the next word or words. GPT-3 is one of the largest such models, having been trained on about 45 terabytes of text data, taken from thousands of web sites such as Wikipedia, plus online books and many other sources. GPT-3 uses a deep-neural-network
“transformer” architecture, described in a technical paper written by its creators at the company OpenAI. (I wrote a non-technical essay about language models that focused on GPT-2, which was GPT-3’s predecessor.)

To interact with GPT-3, you give it a “prompt” — that is, some text — and it generates a continuation of your prompt. For example, I gave it the following prompt:

Q: What is the best horror movie ever made?

GPT-3 answered my question, and then asked and answered two more of its own questions (I put its generated text in italics):

A: The Exorcist.

Q: Who is the best horror actor?

A: Boris Karloff.

Q: Who is the best horror actress?

A: Fay Wray

Many articles and social media posts have given examples of GPT-3’s extraordinarily human-like text, its seemingly endless knowledge of (mostly Western) culture, and even its ability to create computer programs just by being given a few input-output examples. My purpose in this article is not to review the success, hype, or counter-hype on GPT-3. Instead, I want to explore its ability to make Copycat letter-string analogies.

GPT-3 and Letter-String Analogies

OpenAI generously granted me access to the GPT-3 API (i.e., web interface), and I immediately tried it on a few Copycat analogy problems. I did this by giving the program this kind of prompt:

I tried GPT-3 on several problems like this, with varying numbers of examples, and found that it performed abysmally. I gleefully posted the results on Twitter. Several people informed me that I needed to put in spaces between the letters in the strings, because of the way GPT-3 interprets its input. I tried this:

Suddenly, GPT-3 was giving the right answers! I tweeted these results, with the joking aside, “Copycat is toast.”

To my surprise, several prominent Twitter users cited my informal experiments as showing that GPT-3 had human-like cognitive abilities. One person commented “@MelMitchell1's copycat experiments gave me significant pause. Those appear to be cases where the machine is developing concepts on the fly.” Another person stated, “GPT-3 solves Copycat out of the box.”

Such conclusions weren’t at all justified by the scattershot tests I posted on Twitter, so I decided to give GPT-3 a more systematic test.


Below I’ll give the results of the more extensive and systematic experiments I did. GPT-3 is stochastic; given a particular prompt, it doesn’t always give the same response. To account for that stochasticity, I gave it each prompt five separate times (“trials”) and recorded the response. Before each trial I refreshed the website to clear out any memory of prior trials the system might be using.

GPT-3’s API has some parameters that need to be set: a temperature, which influences how much randomness the system uses in generating its replies (I used the default value of 0.7) and the response length, which gives approximately the number of “words” the system will generate in response to a prompt (I typically used 16, but I don’t think this parameter really affects the responses to my analogy questions.) I only recorded the first line of the response (which was usually of the form “A: [letter string]) ; I ignored any other lines it generated. For example, here is a screenshot of a typical experiment (prompt in boldface, GPT-3 response in Roman face):

For this I would record answer i j l and ignore the rest of the generated text (which was most often GPT-3 generating the beginning of an additional analogy question to follow the pattern of the prompt).

The next sections give all the results of my numerous experiments. Since these results are rather long, you might want to skim them and skip to the Conclusions section at the end. On the other hand, if you are an analogy nerd like me, you might find the details rather interesting.

Experiment 1: Simple alphabetic sequences

  1. First, I tried a simple “zero-shot” experiment — that is, no “training examples”.


People’s preferred answer: p q s

GPT-3’s answers (it sometimes didn’t give a letter string answer; I ignored these).

In general, I found that GPT-3 cannot perform zero-shot analogy-making in this domain.

2. Next, I gave it one “training example” — that is, one solved analogy problem.


GPT-3 answered i j l on every trial! It looks like one-shot learning works for this problem.

3. Let’s try to see if GPT-3 can generalize to strings of different lengths.


Humans will easily generalize, and answer i j k l n.

GPT-3’s answers:

So, with only one training example, GPT-3 cannot generalize to the longer string.

4. Okay, let’s give it two training examples of different lengths.


Humans, if they can remember the alphabet, will answer r s t u v x.

GPT-3 never got this answer; here are the answers from its five trials:

Even with two training examples, GPT-3 cannot generalize to the longer string.

5. Let’s be generous and give it three training examples of different lengths.


We’re looking for answer e f g h i j l.

And indeed, now GPT-3 gives answer e f g h i j l on all five trials!

Experiment 2: Alphabetic sequences with grouping

1. Let’s try a zero shot experiment that requires grouping letters:


Humans will almost always say i i j j l l.

In the five trials of GPT-3, the answer was never a letter string sequence. In one trial, GPT-3 amusingly replied “It’s a trick question.” Indeed.

2. Let’s try giving GPT-3 an example.


This was enough: GPT-3 returns m m n n p p on each trial.

3. But what if we ask GPT-3 to generalize to a string of a different length?


GPT-3’s answers:

Not very reliable; it gets the human-preferred answer q q r r s s u u on two out of five trials.

4. Let’s try with two examples.


GPT-3’s answers:

Again, not reliable; the human-preferred answer, e e f f g g h h j j was returned only once in five trials.

5. What about giving GPT-3 three examples?


GPT-3’s answers:

Not once did it return the human-preferred answer of r r r r s s s s u u u u . It definitely has trouble with generalization here.

Experiment 3: “Cleaning up” a String

Another abstract concept in the letter-string domain is the notion of “cleaning up” a string.

1. Here I gave GPT-3 one example.


GPT-3 got this one correct on three out of five trials. Here are its answers:

2. Let’s try this with two examples.


GPT-3 nailed this one, answering x y z on all five trials.

3. Now, a trickier version of “cleaning up a string”. We’ll start by giving GPT-3 two examples.


Most humans would answer m n o p.

GPT-3 returned that answer once in five trials. Here are its answers:

4. Let’s try this again with three examples.


GPT-3 did better this time, getting the “correct” answer j k l m n on four out of five trials (on one trial it answered j l m n ).

5. Finally, I tried an example where the character to remove (here, “x”), is at the beginning of the target string.


GPT-3 did not get this one at all; it answered x i j k on all five trials.

Experiment 4: Analogies involving abstract examples of “successorship”

In these experiments, we look at several analogies involving various abstract notions of successorship.

1. First we see if GPT-3 can generalize from letter-successor to abstract “number” successor.


While this is sometimes hard for people to discover, once it’s pointed out, people tend to prefer the answer j y y q q q q , that is, the number sequence 1–2–3 changes to 1–2–4.

GPT-3 never gets this answer. Here are its responses.

This generalization seems beyond GPT-3’s abilities.

2. What about creating an abstract numerical sequence?


Here we’re looking for b o o c c c v v v v (1–2–3–4)

GPT-3’s answers:

GPT-3 doesn’t seem to get this concept. I tried this also with one additional example, but GPT-3 still never responded with the kind of “number sequence” displayed in the analogy.

3. Let’s try a different kind of abstract “successorship”.


Here the idea is to parse the target string as s — s t — s t u. The “successor” of the rightmost sequence is s t u v, so we’re looking for answer s s t s t u v (“replace rightmost element with its successor”).

GPT-3 got this one: it answered s s t s t u v on each trial.

4. But did it really get the concept we have in mind? Let’s test this by seeing if it can generalize to a different-length target string.


The answer we’re looking for is e e f e f g e f g h i.

GPT-3 got this answer on four out of five trials (on one trial it responded e e f e f g h i ). Pretty good!

Experiment 5: A letter with no successor

As a final experiment, let’s look at problems in which we try to take the successor of “z”.

1. A deceptively simple problem.


Most people will say x y a. This answer wasn’t available to the original Copycat program (it didn’t have the concept of a “circular” alphabet), and it would give answers like x y y; w y z; or x y y. Fortunately, Copycat was able to explain its reasoning to some extent (see references above for more details).

On this one, GPT-3 is all over the map.

GPT-3’s answers:

2. Let’s look at one more.


GPT-3’s answered the reasonable x y z a on four out of five trials (on one trial it answered x y z a a).


I tested GPT-3 on a set of letter-string analogy problems that focus on different kinds of abstract similarity. The program’s performance was mixed. GPT-3 was not designed to make analogies per se, and it is surprising that it is able to do reasonably well on some of these problems, although in many cases it is not able to generalize well. Moreover, when it does succeed, it does so only after being shown some number of “training examples”. To my mind, this defeats the purpose of analogy-making, which is perhaps the only “zero-shot learning” mechanism in human cognition — that is, you adapt the knowledge you have about one situation to a new situation. You (a human, I assume) do not learn to make analogies by studying examples of analogies; you just make them. All the time. Most of the time you are not even aware that you are making analogies. (If you’re not convinced of this, I recommend reading the wonderful book by Douglas Hofstadter and Emmanuel Sander, Surfaces and Essences.)

One of the problems with having to provide training examples like the
ones I described above is that it’s hard to tell if GPT-3 is actually
making an analogy, or if it is copying patterns in the target strings
(i.e. ignoring the “If string1 changes to string2” parts of the
prompt). I didn’t test that, but it would be interesting to do so.

All in all, GPT-3’s performance is often impressive and surprising, but it is
also similar to a lot of what we see in today’s state-of-the-art AI
systems: impressive, intelligent-seeming performance interspersed with
unhumanlike errors, plus no transparency as to why it performs well
or makes certain errors. And it is often hard to tell if the system has
actually learned the concept we are trying to teach it.

I don’t have a definitive answer for the question in the title — can
GPT-3 make analogies? In some cases, it appears to make analogies,
but I don’t think it is doing anything like human analogy-making or
conceptual abstraction. My intuition is that to achieve robust
humanlike performance on analogy, something quite different is needed
than the kind of statistical language-model approach taken by GPT-3.

As usual in such essays, I’ll end by asking the question to GPT-3 itself:

Prompt: Can a language model like GPT-3 make analogies?

GPT-3: The news that Google had cracked the code to making a language model that can make analogies and do so quite well has been met with mixed reactions. On the one hand, it is a major breakthrough in natural language processing and artificial intelligence. On the other, it’s hard to see how this has anything to do with language.


Note: I have written a brief follow-up to this article with the answers to some commonly asked questions.



Melanie Mitchell

Davis Professor at the Santa Fe Institute. New book, “Artificial Intelligence: A Guide for Thinking Humans”: