All that’s wrong about the Turing Test

Why it’s so hard to recognise machine intelligence

The Turing test is a test for machine intelligence. It was first proposed in a 1950s paper by the British computer scientist Alan Turing.

As normally understood today, the Turing test works like this:

You have a computer in a room behind a closed door. In a different room, also behind a closed door, sits a human contestant. Outside these two rooms we have a human judge. The judge is connected with the computer and with the human contestant by a kind of chat interface: so he can communicate with them only through typing. This ensures that when the judge communicates with the computer and with the human, he will not hear their voices or see their bodies. He will only have their written words in front of him. This rules out that the judge will be influenced in his judgement by the appearance, voice and other factors, and that he will concentrate only on the contestants’ replies in the textual conversation.

Now the idea of the test is that the judge can talk to both the computer and the human for an unlimited time, using the chat interface. He will talk to them about any topic he likes, he can ask them questions and evaluate their answers. If the judge is unable to determine who of the two contestants behind the closed doors is the human and who is the computer, then we can say that the computer is intelligent. So this is a behavioural definition of intelligence. The judge cannot see what the contestants in this conversation are made of, he cannot see their structural features and he cannot hear their voices.

Turing himself never called his idea the “Turing Test.” Instead, in his famous paper “Computing Machinery and Intelligence” (1950) he called it the Imitation Game. The Imitation Game, as described by Turing is a pretty confusing affair. It involves a man and a woman trying to convince the judge that the man is a woman and from this Turing tries to construct an intelligence test by analogy. The whole thing is not very clear, and so soon other researchers tried to simplify Turing’s original description and to focus on the intelligence aspect of it, and so it became what today we know as the Turing Test.

Deception vs. intelligence

Photo by Charles 🇵🇭 on Unsplash

Both the Imitation Game and the Turing Test are essentially deception games. The point of the Turing Test is for the computer to deceive the human judge into believing that the computer is really a human. The means of deception are not limited. The computer could try to make spelling mistakes, for example, so that it appears more human.

This has inspired many programmers into trying to create programs that pass the Turing test by heavily relying on deception. There are multiple tricks that a programmer could use in order to improve the chances of the program to pass such a test.

One very effective trick is to produce unfalsifiable utterances. For example, the computer code starts talking in a poetic way, or using metaphors heavily. When we speak in poetry or metaphors, the other side of the conversation does not expect us to make literal sense. Instead, the listener will make an additional effort to provide his or her own interpretation of what has been said. In the context of the Turing Test this is a trick to shift the responsibility for interpreting what has been said to the judge, instead of requiring the program to produce sensible utterances in the first place.

One of the most famous programs of this type has been Racter, a program written by William Chamberlain in 1983. Racter would just chat away in a heavily poetic fashion, producing endless chains of speech that is original and amusing, but that literally makes very little sense. Here is a little taste of how that looks:

“Tomatoes from England and lettuce from Canada are eaten by cosmologists from Russia. I dream implacably about this concept. Nevertheless tomatoes or lettuce inevitably can come leisurely from my home, not merely from England or Canada. My solicitor spoke that to me; I recognize it. My fatherland is France, and I trot coldly while bolting some lobster on the highway to my counsellor.” (Source: http://www.ubu.com/concept/racter.html)

In another post we’ll see more of Racter’s output.

Photo by rawpixel on Unsplash

Another trick is to prevent the computer from calculating too fast and too precisely. If the judge asked how much is 1234 x 1234, and if one of the contestants could answer the question immediately, then the judge would know that this must be the computer. In order to avoid this problem, programmers would introduce long pauses and random calculation mistakes, so that the answers of their programs appear more “human.” In the same way, if a contestant can type flawlessly at high speed, never making a mistake, then the judge would also know that this contestant is the computer. So the programmers would instruct the computer to make random spelling mistakes. They would study the typical mistakes made by humans, so that the type and frequency of spelling and typing errors are similar to those made by humans.

Humans also don’t have a perfect memory, therefore a computer that pretends to be human must also pretend to occasionally forget things.

Another problem is that the judge might ask the computer questions that can only be answered by having the experience of a human body or the typical experiences of living a human social life. For example, he could ask the computer about its preferences for ice cream flavours, or it could ask about the contestant’s birthday.

A computer that does not want to give itself away would have to make up answers to such questions that are similar to the answers that a human contestant would give, although these answers literally don’t have any meaning for the computer.

What follows from all this?

It seems that one of the main problems of the idea of the Turing test is that in order to pass the test the computer will have to employ a whole array of deceptive behaviours which have really nothing to do with intelligence. Arguably, a self-driving car exhibits quite a lot of intelligent behaviour: It can avoid pedestrians and collisions with other cars, it can stop at traffic lights, it can navigate around a city by reading a map. These are all highly complex and intelligent behaviours. But clearly such a self-driving car would fail a Turing Test, because it cannot communicate in natural language and therefore it would never convince a judge in a Turing Test that it is a human contestant.

On the other hand, there are very simple computer programs that just use a long dictionary of phrases in order to answer the judge’s questions. The designer of such a program tries to anticipate all possible questions that a judge might ask and will provide a pre-canned response for each question. If done diligently, such a program can achieve quite a high score in a Turing Test and perhaps convince a judge that he is conversing with a human being. But clearly, the amount of intelligence contained in such a program is far less than what we find in a self-driving car.

So the fundamental problem is that a machine could be intelligent without being able to successfully deceive a judge about being human. On the other hand, a program that passes the Turing Test could be just a dumb pattern matching machine that does not have any genuine intelligence.

Successful deception of a human judge is therefore not a reliable criterion for the actual intelligence of a computer program.

Is the Turing Test misunderstood?

Photo by Helloquence on Unsplash

This idea is also at the core of Whitby’s criticism of the Turing Test. According to Whitby (1997), at the basis of the Turing Test lies the following misunderstanding:

“If we can, by whatever means, build a computer-based system that deceives a human interrogator for a while into suspecting that it might be human, then we have solved the many philosophical, scientific, and engineering problems of AI.”

He points out that having particular skills of deception is not the same as being intelligent. After all, we should not forget that Turing himself never called his test the Turing Test. He was always speaking of the “Imitation Game.” So he considered this to be a game and not a formal test. Perhaps therefore the whole idea of the Turing Test as a test for intelligence is based on a misunderstanding. It began as a proposal for an amusing thought experiment, but has since developed into a serious test for computing intelligence, which it was never really intended to be.

It’s Turing’s, but is it really a test?

Photo by SpaceX on Unsplash

“If the Turing test is read as something like an operational definition of intelligence, then two very important defects of such a test must be considered.” (Whitby)

Whitby offers a second criticism of the Turing Test. He says that if we want to see the Turing Test as an operational definition of intelligence, then it must fulfil the criteria for a good operational test. What are these criteria?

Let’s say I want to determine whether humans know how to travel to Mars. What would be a good test for that?

First the test must reliably confirm whether the goal has been reached. In the case of Mars travel, the test could consist of looking at the surface of Mars and determining whether humans have actually stepped on it, whether they have left a flag there, and so on. These would be signs that humans have passed the test. The Turing Test seems to lack such a clear and unambiguous goal. Just confusing a judge about the nature of the contestants is not enough to clearly and unambiguously prove that the participating computer is intelligent. Confusing a judge proves only that the judge has been confused, nothing more. Getting the judge drunk or drugged, for example, might have the same result of letting an unintelligent machine pass a Turing Test, but it wouldn’t say much about the actual properties of the machine. Rather, it would be a statement about the inadequacy of the judge.

Photo by Paolo Nicolello on Unsplash

Second, any good test should give an indication of what partial success looks like, what making progress towards the goal means. A good test is not an all-or-nothing test. For example, if my goal is to reach Mars, then partial success would be to leave the Earth’s atmosphere. An even better partial success would be to go into a high orbit around Earth. Even more successful would be to reach the Moon, and so on. So here I have a clear progression from being unable to travel to space at all to finally reaching Mars. I can divide the journey into multiple steps and every step can be clearly said to be one step further on my way to reaching the final goal.

The Turing Test does not have this quality. It is basically an all-or-nothing test. The machine either fools the human judge into believing that it is a human being, or it doesn’t. The Turing Test does not provide any notion of partial success and it does not explain what such partial success might look like. It can therefore not tell us what “making progress” towards the goal might mean. If I want to reach Mars, the first step would clearly be to fly up to low Earth orbit and leave the atmosphere. If I want to pass the Turing Test, what would be the first step to attempt? The Turing Test does not provide any guidance here.

Also, a good test should give some indication of how success might be achieved, by what means we might make progress towards achieving the goal. Again, if I want it to reach Mars and the test is to see if a flag has been deposited on the surface of Mars, this immediately suggests ways to achieve this outcome. I need to have a flying machine that can fly upwards, that can leave the atmosphere, that can follow a trajectory to Mars, that can land on Mars, and that has the ability to stick a flag into the surface of Mars. I can immediately envision a whole engineering program. In the case of the Turing Test, no such steps suggest themselves. We know that we want the judge to recognise the computer as a human being, but the test does not suggest at all how we should go about achieving this result. Would cheating the judge be the right way to do it, or would it be better to be honest and instead focus on having a big a database of facts that allows the computer to make a meaningful conversation? Should the computer utilise metaphoric and poetic speech in order to convince the judge that it is human, or should it speak plainly and directly?

Photo by Mike Szczepanski on Unsplash

The final criterion is that a candidate should not be able to randomly pass the test. If I just randomly throw things into the air I will not be likely to ever successfully stick a flag onto the surface of Mars. Achieving the outcome of actually sticking a flag into Mars means that I really have mastered space travel. The result is not likely to be achieved by accident. As opposed to this, the Turing test can clearly be passed by accident. It could happen that some of the computer’s stock answers so well happen to match the questions that the judge is asking, that, perhaps assuming that the judge is not very critical, or that he doesn’t know much about computers, he might be convinced that the computer is a human being. So this is another respect in which the Turing test is not a good test for machine intelligence.

Thanks for reading! That’s the first part of a two-part post. Stay tuned for more criticisms of the Turing Test!


Dr Andreas Matthias is the author of “Neural Networks Without the Math,” a gentle introduction to neural networks and deep learning. Find it on Amazon.

Originally published at moral-robots.com on January 31, 2019.