The Turing Test: At the Intersection Between Artificial Intelligence and Philosophy

Published in

Geek Culture

8 min readJun 8, 2021

Defining intelligence with the Turing Test

If you have ever done one of those tests where they require you to identify some bicycles or fire hydrants in a picture or type out the letters displayed in a scrawled out note you have done a Completely Automated Turing Test to Tell Computers and Humans Apart (CAPTCHA). Failing my first ever CAPTCHA test a few years back made me wonder if I was just a very dumb human, or if I was actually a self aware robot that had finally been exposed. It really makes you stop and think how sophisticated our computer programs have become that the very tests meant to tell computers and humans apart are unable to do so in one attempt. One of the main reasons for this is because the traits that make up our understanding of human intelligence are still not fully defined yet. There is no clear-cut equation describing it, just a bunch of fuzzy rules and gut feeling instinct. Underlying the most widespread of systems are still contentious philosophical issues, which is what makes the study of artificial intelligence and the debate surrounding it so interesting.

In a 1950 paper published in Mind, Turing proposed a “solution” to the much debated issue of defining intelligence practically, a solution that has now become a widely studied and adopted benchmark known as the Turing Test[1]. The idea of the CAPTCHA came later, but the motivation is the same — finding a way of assessing the “humanness” of a subject. The current modern interpretation of the Turing test is a situation in which Players A and B hold separate text-based conversations with Player C, also known as the interrogator, and attempt to convince Player C that they are human with their responses given. However, one of the players, either A or B, is a computer that is attempting to imitate a human. It is assumed that Player C is aware that either A or B is a computer but does not know which one (as opposed to variations of the test where Player C is not aware that a computer is present). The role of the interrogator is to discern which of the players is the human and which is the computer given a finite interaction time of typically several minutes.

As the decision is made purely based on how human the computer can act, the Turing test conveniently disregards the internal states of the mind, emotions, the subconscious and effectively focuses on the tangible result of the thinking process, making it an operational test of computer/human indistinguishability.

Throughout history, several programs have come convincingly close to clearing the Turing test, such as PC Therapist by Joseph Weintraub in 1991 and Rollo Carpenter’s Clever bot in 2011. The first widely accepted success was by the computer program Eugene Goostman in 2014 at the Royal Society in London. An analysis of the results[2] credited the following features for Eugene Goostman’s success: the machine had character, it frequently asked questions to the interrogator, and it occasionally made spelling errors. This made for a rich conversation with the interrogator that at times stood out from the relatively duller conversation between two actual humans, which was a cause for some of the misidentification.

Should we even use such a test in the first place?

The Turing test has been praised for its simplicity and its breadth. It allows the performance of artificial intelligence to be measured, albeit from an oblique angle that does not deal with precise definitions of thinking and intelligence. It also allows for unlimited scope in terms of conversation topics, just as how a normal conversation can flow between two humans. Last, it emphasizes human-like thinking and emotional intelligence rather than one which focuses on discursive intelligence only.

One widely studied aspect of the Turing test is the confederate effect and its converse, the Eliza effect. The former is where the judge misidentifies a human as a machine and the latter is when a machine passes successfully as a human. Research into the results of the 2003 Loebner Prize[3] (an instantiation of the Turing test) showed that the human players had mean scores of less than 4.0, indicating that most of the interrogators could not positively identify them as human and had marked them as “probably a human”. In our daily interaction with other people through text messaging, could we tell if the other party has been replaced by clever chatbot? What kinds of behaviours are dead giveaways and what should we look out for? After all, some intelligent behaviour is inhuman, such as performing fast arithmetic calculations, while some unintelligent or illogical behaviour is human, like non-sequitur, and past a certain point, the inadequacies of a robotic mind start to coincide with the eccentricities of the human one it is trying to mimic.

A common criticism that is levelled is that the test has caused teams to focus on creating artificial intelligence with the sole aim of passing the test instead of focusing on real intelligence. Machines that are specially designed for the test are often optimised with various tactics meant to purposely mislead the human judge without actually furthering the actual intelligence of the system. For example, the Eugene Goostman program used a variety of question deflection techniques and witticisms to cover up the fact that it did not know the answer to simple elementary questions. Creating a chatbot that can fool humans is not the same as creating real artificial intelligence such as a program that can read a newspaper and answer questions based on the content in the papers, because no amount of evasive manoeuvring can ultimately compensate for skills like reasoning and semantic understanding.

Furthermore, the broad field of artificial intelligence deals with areas that are not directly related to natural language processing and conversation but still require machine intelligence, for example, machine vision and robotics. Using the Turing test sets a misleading standard for the definition of success in the field and downplays the contribution of artificial intelligence in a diverse range of applications from the stock market to autonomous vehicles.

Another argument against the test asserts that there is no basis for pinning the development of artificial intelligence to mimic that of the human brain. The brain may be used as a model for the development of neural networks, but it should not necessarily be the endpoint or the overall goal, of neural network development. An oft cited analogy is the study of flight, which used to concern itself with recreating the flapping motion of the wings of a bird but failed to achieve any significant success. However, once scientists moved away from trying to mimic nature and turned to actually understanding the physics and theory behind flight, much more progress was made in the field. Hence researchers may benefit more by not trying to perfect a human-like robot, but by building one from fundamental theories and first principles of intelligence.

If it walks and talks like a duck, is it a duck?

The debate regarding Turing test naturally leads into one concerning the legitimacy in using such techniques to measure human intelligence, and whether we can even measure it in the first place.

One argument goes that ascribing intelligence solely based on behavioural performance in the Turing test is absurd, as thinking cannot be fully described by just the results collected on the behaviour of a machine. Externally observable behaviours like walking and talking like a duck are not a rigorous basis for identifying ducks. The essence of the objection being that the Turing test is ultimately based on a behavioural construal of the concept of thinking which has no strong underlying basis.

Yet when we lack a more sophisticated understanding of the mind, behavioural analysis is the only way we can rigorously study intelligence. In fact, forms of the imitation game are regularly used to assess humans too — for example assessment in the form of viva voce, think of a PhD defence, with the aim to discover whether one has truly understood the content, rather than just memorizing it in a parrot fashion. Such assessments are naturally behaviour based, as we have no way of critiquing thinking.

Even if a machine could pass the Turing test, it may not necessarily be intelligent. In 1980 John Searle introduced the Chinese room thought experiment[4]: a human with no understanding of Chinese sits in a room following a set of instructions to produce Chinese output can give the impression to outside observers that the machine understands Chinese. Searle asserts that computers, which merely follow instructions, would never be able to truly gain intelligent understanding of a subject. They would only be a simulation at best — what Searle terms as “Weak AI”. So when IBM’s Deep Blue beat Garry Kasparov at chess, the machine may have appeared intelligent, but it had as much understanding of chess as Searle has of Chinese, that is to say, none.

Searle’s philosophy follows the idea that the intelligent mind cannot be reduced to an algorithm that we can define. Furthermore, creating an artificial intelligence on par with the human mind would involve testing for emotions, self-awareness, soul, originality, and all of the other multiple facets of human consciousness alongside intelligence. Robert French argued similarly in 1990 that the Turing test was limited as it could only measure intelligence as defined by human culture and experience[5]. Hence other tests besides the Turing test are needed to measure what these thinkers view as “true intelligence”.

Of course, the contrary view contends that all human actions, from cooking to studying, can be reduced to the mere execution of an algorithm stored in the brain and hence can be effectively copied by a machine. Humans are nothing more than flesh and blood computers, with every decision stemming from a deterministic system of interacting atoms, no different from the machines we build.

At the end of the day, the Turing test does little in helping us understand the true nature of intelligence, which is more a question for the philosophers and psychologists. It can help serve as a benchmark for accomplishments in machine intelligence but it should definitely not be taken as a singular measure of an artificial intelligence’s effectiveness. Today, several alternatives to the Turing test exist including the Lovelace test, Total Turing Test, Wozniak test, as well as variations on the original Turing test itself. These tests help evaluate additional yardsticks that contribute to intelligence. Recent advancements have brought us close to creating truly “human-esque” thinking, but the best machines today such as OpenAI’s GPT-3 still reveal a general lack of common-sense, and would rely on various shady tactics to pass off as one of us. Ultimately, the importance of the test today seems predicated on the difficulty we have in defining intelligence, and what the essence of human thinking is. So, until philosophers and psychologists achieve greater clarity in that area, the significance and relevance of the Turing test is fair game for anybody.

References

[1] Turing, Alan M. 1950. Computing machinery and intelligence. MindLIX(236):433–460.

[2] Warwick, K. and Shah, H., 2016. Can machines think? A report on Turing test experiments at the Royal Society. Journal of Experimental & Theoretical Artificial Intelligence, 28(6), pp.989–1007.

[3] Shah, H., 2005. The confederate effect in human machine textual interaction. WSEAS ISCA, Available at: https://www.researchgate.net/publication/236889402_The_Confederate_Effect_in_Human-Machine_Textual_Interaction.

[4] Searle, J., 1980. Minds, Brains and Programs. The Behavioural and Brain Sciences. Behavioural and Brain Sciences, 3(3), pp.417–424.

[5] French, R., 1990. Subcognition and the Limits of the Turing Test. Mind, 99(393), pp.53–65.

The Turing Test: At the Intersection Between Artificial Intelligence and Philosophy

Written by Claudius silvanus