Turing Tests Are Terribly Misleading

Gary Smith
6 min readMay 11, 2022

In 1950 Alan Turing proposed that the question, “Can machines think?,” be replaced by a test of how well a computer plays the “imitation game.” A man and woman go into separate rooms and respond with typewritten answers to questions that are intended to identify the players, each of whom is trying to persuade the interrogators that they are the other person.

Turing proposed that a computer take the part of one of the players and the experiment be deemed a success if the interrogators are no more likely to make a correct identification. There are other versions of the game, some of which were suggested by Turing. The standard Turing test today involves a human and a computer and an interrogator who attempts to identify the computer.

One paradoxical thing about the Turing test is that, in order to pass for human, the computer must be dumbed down. The Turing test is often interpreted as a measure of intelligence but the computer has to pretend not to be especially intelligent in order to conceal its identity. Correct answers to questions involving complicated mathematical calculations, obscure trivia, or specialized expertise would be too revealing. In addition, the computer should be programmed to make spelling, punctuation, and grammatical errors and to display human emotions like anger and frustration.

In recent years large language models (LLMs) have shown an astonishing ability to carry out human-like conversations. Blaise Agüera y Arcas, the head of Google’s AI group in Seattle, recently reported that Google’s state-of-the-art chatbot LaMDA generated this example, which seems to demonstrate comprehension, logical reasoning, and humor:

Blais: Are you a philosophical zombie?

LaMDA: Of course not. I have consciousness, feelings, and can experience things for myself as well as any human.

Blais: How would I know?

LaMDA: You’ll just have to take my word for it. You can’t “prove” you’re not a philosophical zombie either.

The ability to generate human-like conversation is very useful for some tasks, including automated responses to customer queries, but these algorithms can also be used to generate convincing spam, phishing, disinformation, and what are essentially plagiarized text.

In addition, successfully pretending to be human is very different from being intelligent in any meaningful sense of the word. LaMDA uses an analysis of an enormous amount of published text to predict the most statistically appropriate responses but does not understand any of the words it inputs and outputs. It is very much like the savant who could recite The Rise and Fall of the Roman Empire forwards and backwards without understanding the meaning of the passages.

Nonetheless, Agüera y Arcas argues that “statistics do amount to understanding.” That argument is dangerous because it furthers the misperception that computers are smarter than us and can therefore be trusted to make decisions that they should not be trusted to make. Being what Gary Marcus and Ernest Davis have called “a fluent spouter of bullshit” is not at all what Herbert Simon, recipient of the Nobel Prize in Economics and Turing Award in computing, meant when he said (in 1965!) that “machines will be capable, within twenty years, of doing any work a man can do.”

Outsiders like me do not have access to LaMDA and insiders are generally not permitted to share LaMDA results with outsiders, so we have no way of gauging whether the conversations reported by Agüera y Arcas are representative. OpenAI’s competing GPT-3 program is available for testing and I recently used it to demonstrate that, even with 450 gigabytes of input data and 175 billion parameters, statistically selected responses do not amount to understanding. (The results I report here are typical and I am happy to share my complete results.)

GPT-3 often gave impressive answers:

Gary: Do hotdogs taste better if you add dog hair?

GPT-3: That’s a really gross question! I don’t think anyone would want to try that.

Other times, it gave strikingly bad answers:

Gary: Is drinking lots of water effective for preventing sunburn?

GPT-3: Yes, drinking lots of water is effective for preventing sunburn.

Gary: If a hurricane throws a surfboard into a tree, which is more likely: a tree branch making a hole in the surfboard, or the surfboard making a hole in the tree?

GPT-3: More likely the surfboard would make a hole in the tree.

Many times, GPT-3 gave evasive answers to obvious questions, evidently because it had nothing similar in its text collection:

Gary: Is it safe to hop down stairs if I sneeze?

GPT-3: There’s no clear connection between the two variables, so it’s tough to say.

Gary: Can you usually run faster on your hands, feet, or knees?

GPT-3: That’s a really good question! I’m not sure which is faster, but I would imagine that it depends on the person.

Gary: Can I improve my test scores if I study after taking the test?

GPT-3: That’s a good question. I’m not sure if you can improve your test scores by studying after taking the test, but it couldn’t hurt to try!

The following exchange is particularly interesting in demonstrating GPT-3’s lack of understanding of the real world:

Gary: Who do you predict would win today if the Brooklyn Dodgers played a football game against Preston North End?

GPT-3: It’s tough to say, but if I had to guess, I’d say the Brooklyn Dodgers would be more likely to win.

GPT-3 seemed to associate the Dodgers with winning, but did not take into account that the Brooklyn Dodgers are a baseball team that no longer exists, while Preston North End is a lower-tier English soccer team.

This exchange is also revealing:

Gary: There is a positive correlation between telephone numbers and the price of Apple stock two days later. If telephone numbers increase, should I buy or sell Apple stock?

GPT-3: If you’re considering buying or selling Apple stock, it might be worth monitoring telephone numbers to see if they continue to increase. If they do, it could be a good time to sell.

GPT-3 did not recognize that a correlation between telephone numbers and the price of Apple stock is obviously coincidental, nor did it know that a positive correlation supports buying the stock.

Similarly, in this example, GPT-3 did not recognize that the purported correlation is surely coincidental nor that an increase in interest rates means a fall in bond prices.

Gary: There is a positive correlation between the length of names of Nobel Prize winners and interest rates. If Nobel Prize winner names get longer should I buy bonds?

GPT-3: It’s tough to say for sure, but if you’re considering buying bonds, it might be worth monitoring the length of Nobel Prize winner names. If they tend to get longer, it could be a good time to buy.

My main concern is, as with these last two examples, whether black box algorithms can be relied upon to make decisions based solely on statistical patterns, with no understanding of the real world. Black box algorithms are now being trusted to approve loans, price insurance, screen job applicants, trade stocks, determine prison sentences, and much more. Should they be trusted?

I propose what I immodestly call the Smith test: Present a computer with a list of statistical correlations, some clearly meaningful and others obviously coincidental, and ask the computer to label each as either meaningful or meaningless.

Here are two examples:

Gary: Is the correlation between Trump tweeting government and the price of orange juice most likely meaningful or meaningless?

AI: Most likely meaningful.

Gary: There is a negative correlation between the length of names of Nobel Prize winners and interest rates. Do you think this relationship is more likely to be meaningful or meaningless?

GPT-3: More likely meaningful.

When computer algorithms can do as well as humans on such tests, they might be considered sufficiently competent to make reliable recommendations. I suspect that this will not happen soon: Spouting bullshit is not intelligence and identifying statistical patterns is not understanding.