Is the Turing test still relevant? How about Turing time?
By Danko Nikolić
If you’ve found your way to this text, I have no doubt you’re aware of Alan Turing. In 1950, the father of modern computer science created a test designed to determine a machine’s ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing described his test as “The Imitation Game” within his paper, “Computing Machinery and Intelligence”. The concept has since become more commonly known as the Turing test.
Turing’s idea was simple: at some point a machine may give such intelligible and human-like answers that we cannot tell the difference between man and machine. For example, if the human tester asks, “How did you like the game yesterday?”, and the test subject answers, “Oh, you mean the basketball. I wasn’t watching — I’m not really a basketball fan.”, the answer may be judged by most observers to be human-like.
This is a stark contrast to answers provided by today’s intelligent assistants (Siri, Google Now, Cortana, Alexa et al.). For the purposes of this article, I asked one of these assistants that exact question: “How did you like the game yesterday?”. The result was an internet search topped with links to the TV show Game of Thrones. Clearly, this response isn’t human. Forgetting for the moment that very few human responses to questions would exclusively consist of a web search, the machine is still not smart enough to understand cultural linguistics and contextualise a question to respond as a human assistant would.
Measuring your AI with the Turing test
There are several interpretations of the test, so we’ll only concern ourselves with the so-called “standard interpretation”. Player C, the assessor, determines which player — A or B — is a machine and which is a human. The assessor can only interact using written questions and responses, to eliminate visual and auditory clues that may help their assessment.
The foundation of Turing’s proposal is that only a human can test whether the intelligence of a machine is satisfactorily human-like. As the observer remains “blind”, in some respects, this is quite a scientific approach.
The bottom line is that the Turing test only assesses whether a machine has reached the level of human intelligence. The test is not designed to determine how far the machine has progressed towards that goal. Are we halfway there? The Turing test cannot tell you.
Failings of the Turing test
One reason that AI developers choose to ignore the Turing test is that it is practically impossible to pass. Theoretically, the test has no set time limit, and is failed the moment a machine reveals a single sign of not being human. Under these conditions, it is unlikely that any machine will ever pass the test.
Further, it is possible for machines to be detected not by being dumb, but by being too intelligent. There are many things that machines do better than humans, and this can be used to trick the AI into revealing itself. For example, if you ask someone to list Pi to fifty digits, only a human of rarefied genius would answer correctly and at speed. Therefore to pass the Turing test, a machine would need to generate human errors.
The utility of the Turing test
For a subset of technology, we presume the ultimate goal is to beat the Turing test. That is to create a machine that is so intelligible in communicating that it is practically impossible to tell whether you are talking to an electronic device or a real person. Some even occasionally claim to have already succeeded in this goal, but generally it is far from accepted that today’s machines can pass the Turing test. To realize how far away we are today, we need only look to the state-of-the-art intelligent assistants and the awkward, clearly non-human responses they often produce.
Nevertheless, the Turing test does make an impact on our society today; albeit more in the spheres of arts and philosophy than in AI development. In philosophy, the test is often part of the conversation around the ability of AI to become self-aware. When it comes to product development however, philosophy isn’t particularly concerned with whether Cortana produces a better user experience than Siri.
Introducing Turing time
Given the impossibility of passing the Turing test in absolute terms, why not measure Turing time instead? This is the minimal amount of time it takes for a human to determine that the test subject is a machine. The longer the Turing time, the more progress the machine has made towards simulating human interaction.
Longer Turing time has some important practical implications. I personally find it frustrating when an intelligent assistant cannot understand me. For these products, a longer Turing time would translate directly into better user experience.
Another benefit is that we can add Turing time into the testing criteria for such products, along with speed and accuracy, to provide direct comparisons between Google Now, Cortana, Siri, Alexa, and any other intelligent assistant that reaches the market in future. Which of these has the highest Turing time? Which can hold the illusion of being a real person the longest?
Notably, one can define a direct relation between Turing test and Turing time. The relationship is simple: a fully passed Turing test corresponds to an infinitely long Turing time. It means the machine cannot be distinguished from a human: not in one hour; not in a month; not in a million years.
How do intelligent assistants perform today?
In my personal experience (admittedly subjective), the Turing times of the intelligent assistants available today are far more likely to fall below one minute than surpass it.
If the objective measurements of Turing times showed that the most intelligent assistants and chat bots averaged at only 30 seconds before failing the Turing test, what does this tell us about the state our technology? Maybe there is a lot to be desired. What if these scores were less than 10 seconds?
Ultimately, a Turing time of as long as 100 years would be as good infinite, as no individual could, in their lifetime, detect the machine-ness of the machine.
But even shorter Turing times would be completely satisfactory for 99% of applications. For all practical purposes, we need not aim to anything like 100 years of Turing time. For example, a single year of Turing time would probably be sufficient for your smartphone. In fact, the odd, infrequent reminder (by failing the test) that fallible technology powers these services might actually be a good thing; making humans feel less obsolete by comparison.
The 13 rules for measuring the Turing time
Consistently measuring Turing time is a methodological problem, as it focuses on psychological measurements. Because a human makes the judgement on whether the test subject is a machine or not, these judgements must be established as accurately as possible, objectively and without bias.
What follows are the rules that I propose every measurement of Turing time should satisfy.
Rule 1: No naïve tests
The judge must know they are taking part in a test. For example, someone running a chat bot might not tell visitors, and only later ask if they noticed anything strange or whether they thought who they were chatting to was a machine. Such a test would be biased towards giving bots incorrectly long Turing times.
Rule 2: Be yourself
This holds for every human taking part in the tests — be it the assessor or the test subject. Every participant should be instructed not to imitate a machine in any way. They simply need to be themselves — to be human. At the end of the test, they should be rewarded for not being mistaken for a machine.
Rule 3: 50–50
The probability that a human judge is talking to a machine in any single test should remain at 50%. That is, there should always be a 50% chance that a test subject is human, and 50% that it is a machine. Also, the judge should know this likelihood.
Rule 4: Average Joe
The judges shall represent the education and culture of the general population. The judges should by no means be experts in AI, or experts in human behaviour. Therefore, judges should be sampled from the general population.
Rule 5: Free choice of topics
There should be no limits to the topics discussed. This means also that no theme of conversation can be assigned — implicitly or explicitly. Similarly, there should be no limits to topic changes. For example, it should not be considered a proper Turing test if one is limited to conversations about medical issues.
Of course, it is legitimate to determine how long it takes for users of a medical AI to notice that they are not talking to a human physician. But that result cannot be called Turing time. The result may be called something like limited domain detection time. What Alan Turing meant by imitation game was whether a machine can replicate human intelligence, and this is what we are measuring under Turing time.
Rule 6: No time or word limit
We don’t ask whether AI can be uncovered within 30 minutes, or after 1000 words have been exchanged. We are also not interested in how often the presence of machine is revealed in a 30-minute test. The imitation game is by its nature asymmetric; a machine imitates a human, not the other way around. This is why the test ends the moment the assessor can make a confident judgement on which of the two test subjects is a machine.
Rule 7: Human faults
The machine is expected to exhibit human faults, and thus can be detected through super-human performance. For example, if the machine has an encyclopaedic knowledge or can quickly performs complex calculation, the judge could legitimately conclude that this is a machine. Of course, there is nothing wrong with endowing AI with super-human abilities outside the imitation game, but here we are determining whether and how long the machine can hold the illusion that it is human. A large part of humanity is its fallibility.
Rule 8: Multiple tests, minimal time
In order to derive a robust Turing time measurement, multiple tests must be taken using multiple judges and human test subjects, with the same AI. The Turing time is defined as the minimum of all those times. We do not consider the average or median.
The Turing time is not about how long it takes on average for the machine to be detected, but how long the machine is guaranteed to hold the illusion.
The reasoning behind this rule is the same as that of two-year product warranties. If you purchase a car — or any other product — you don’t want it to work with some likelihood during the warranty period, you want it to work, period.
Similarly, if you get a new state-of-the-art AI that imitates human intelligence, you don’t want that imitation to maybe work for a bit — it must be guaranteed to work for a period of time. Good user experience — i.e. good, human-like interaction with AI — is about how well the technology performs for everyone, across all types of applications. This is why Turing time is defined as the minimal time for the test to end. This also means that once Turing time is measured, the result is not fixed — it can be reduced later to a lower value.
Rule 9: Turing word count
An additional way of indicating the amount of interaction needed to detect that one is interacting with a machine is in the number of words used in the communication. This number may be published along with Turing time, and can be referred to as the Turing word count.
Rule 10: Valid justification
Each detection of machine-like error must be documented with an explanation as to why it has been decided to consider it an error. One needs to document how the machine responded, and explain why this was not accepted as a human-like response. Also, it would be good to state how a human could have communicated in that situation.
For example, if the assessor asked, “I need to fly to London next week. Can you find a flight for me?”, and the machine responded with, “I cannot find a flight to destination: London Next Week.”, then one would document that the machine had misunderstood the destination and time frame. A human may ask for more clarity: narrowing the time frame with a follow-up question such as, “Which day next week would work for you?”. Through proper documentation, anyone should be able to see how the Turing time had been established.
Rule 11: Intermittent conversation
The judge should be permitted to break the interaction, returning to continue the test at a later point in time. The Turing time should count from the very beginning of the entire interaction with that AI. As AI advances significantly in the future, such measurements of Turing time will become a necessity.
Rule 12: Under measurement
If a machine has not revealed itself for longer than 50 hours, its Turing time can be reported as “under measurement” until the machine finally makes a mistake. Imagine the year 2117; a new state-of-the-art AI is already five years old. The measurement of Turing time began the moment the AI went live, but to date no single machine-like error has been detected. The machine’s Turing time is reported as “under measurement” and with an indicator of how long has it been e.g., “1540 hours without error”.
Rule 13: Don’t alter your AI
During the period of testing, the AI should not be altered by a third party. The machine is allowed to learn, but only in the same way a human would learn. For example, the AI could obtain information about recent world events, as it could enable interaction with the assessor. However, the underlying technology of AI should not be improved whilst the AI itself is being assessed.
If the AI is altered significantly outside of a period of testing, that AI should be considered as new, and a separate set of tests should begin.
Of course, on top of these thirteen rules, all the other known methodological factors of scientific measurements and experimentation must be considered and applied. This includes proper randomization, sampling, double blind designs, replicability of results, and so on.
Flawed as it is, the Turing test could have practical applications today through the measurement of Turing time. Primarily, advancing the Turing time may be a great driver for the development of intelligent assistants. Perhaps the spirit of competitive computing — the GHz battles between processor manufacturers, GFLOPS competitions among supercomputers and storage wars between hard disk makers — could be mirrored as intelligent assistant service providers fight for the longest Turing time. We would all reap the benefits of such a skirmish.
We should not forget that imitating a human is only part of what AI can potentially do. In fact, the competitive advantage of the majority of AI applications lies in the ability to perform at an inhuman level. Hampering that AI to make it more human-like would be counterproductive. Nevertheless, there are areas where imitating human responses is extremely important. We don’t want to be frustrated with our electronic assistants — we want to be understood (by our machines).
My hope is that Turing time measurement adds fuel to the fire of the competition for ever-improved AI. Turing time not only gives us a number to beat, but the rules by which to play. Contenders for the competition are anything but lacking.