Turing Test for Testing
In 1950, Alan Turing proposed a test to assess a computer’s ability to imitate a human. In 2018, we need a Turing Test to determine if a computer can imitate a learner.
If the media is to believed, education isn’t producing the types of young people that business and society need. According to the media, we need literate, numerate, critical thinking, problem-solvers — and the education system isn’t producing them.
The Rise of the Machine
It’s certainly true that the nature of employment is changing. The job market has traditionally had three types of employment: (1) low skilled; (2) administrative/managerial; and (3) high skilled. Current employment trends are squeezing the first two groups. Predictions about future employment indicate that many low skilled jobs will go (anything that can be automated will be automated) and that administrative and managerial jobs will decline with the rise of AI and machine learning. Two types of job will prosper: low skilled, service/caring jobs and high skilled, technical/professional jobs. Your children may face a stark choice — low paid care for the elderly or high paid work in data science. If this scenario plays out, it will have an impact on what society needs from education.
The education system hasn’t changed a great deal since the inception of mass education in the late 19th/early 20th Centuries. Children aren’t beaten any more but the fundamentals haven’t changed much — and the part of education that has changed least is assessment. The sorts of tests that school leavers faced in the 19th Century have a striking similarity to contemporary testing. This has been attributed to the “straight-jacket of success” — it works, so don’t change it. And it does, sort of, work. Traditional assessment is a good fit for traditional teaching. The best way to assess what’s learned in the artificial environment of the classroom is the artificial environment of the examination room.
This explains why e-learning has performed so poorly. Numerous studies have shown that e-learning is less effective than traditional teaching and learning. But what these studies really show is that nothing beats experienced teachers preparing learners for a familiar examination (sometimes called “teaching to the test”). E-learning would perform better if it wasn’t competing against the closed eco-system of teacher->drill->predictable exam.
So, what’s the problem? The problem is that the disparity between the world inside and outside of the classroom is becoming a sort of collective cognitive dissonance. Outside the classroom we use computers, smartphones and other digital devices for almost everything. Inside the classroom we’ve created a computer-free virtual world. Smartwatches were recently added to the growing list of prohibited items that kids aren’t allowed to take into examinations; a list that includes calculators, laptops, tablets and smartphones. These devices are prohibited because they would help learners pass the test, which begs the question: what kind of test can be passed by a watch? This is going to get worse. The Internet of Things (IoT) is around the corner, which will make more things “smart”, and increase the list of prohibited items. It’s an arms race that education will lose — and look silly in the process.
Turing Test
In 1950, Alan Turing, a British mathematician, proposed a test for computers to assess their ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, a human. Any device that can successfully imitate a human is said to have passed the Turing Test. No current commercial device has passed the test but some are getting close. Inexpensive smart speakers — such as Google Home — understand natural language (in many different accents) and are capable to conducting limited conversations. The consensus is that, within a decade, these household appliances will pass the Turing Test and become indistinguishable from conversing with humans.
A few years after Alan Turing proposed his test, Benjamin Bloom introduced his “taxonomy of cognitive competencies”, which tried to formalise the language used by educationalists to describe different kinds of performance. Prior to the introduction of Bloom’s Taxonomy teachers used the same words for different things — a teacher might have asked learners to “describe” something when she really meant “explain”. Words like “state”, “compare”, “analyse” and “evaluate” had no common meaning. Since its introduction, the taxonomy has become a de facto standard in education so that teachers now mean the same thing with the same words. Bloom didn’t only standardise language. He also placed those words in a hierarchy, with factual recall at the bottom and creativity at the top. So words like “state”, “describe”, “explain”, “compare”, “analyse” and “evaluate” not only had commonly understood meanings but were also understood to have different intellectual demands. For example, a “describe” question is less demanding than an “explain” question, which is less demanding than an “evaluation” question. The introduction of the taxonomy was a big step forward for education and had important implications for testing.
Difficulty and demand aren’t the same things. Bloom’s Taxonomy defines demand — not difficulty. Difficulty is “how hard” something is; demand is “how clever” it is. A low demand question (say, one that asked you to describe something) can be difficult— which is the staple diet of TV quiz shows that invariably ask low-demand, high-difficulty questions such as “What is Eskimo for ‘good morning?’”. These questions are low demand because they relate to factual knowledge — but they’re difficult to answer. The demand of a question is its intellectual challenge. Demand rises as you progress through Bloom’s hierarchy for a given domain. That last bit is important. Demand only rises relative to a given domain. Not all “describe” questions are easier than “evaluate” questions. Describing nuclear fusion is more demanding than evaluating road conditions (crossing the road). But within a given domain, Bloom’s hierarchy works. Describing nuclear fusion is less demanding than explaining it, which is less demanding than evaluating it. One last thing. Everything is low demand when it’s learned by rote. When a learner regurgitates an answer that s/he has memorised then they’re not demonstrating anything — apart from a good memory. The predictability of some examinations means that apparently demanding question are simply memory-dumps.
Which brings us back to watches. Technology is great for answering low demand questions (the “describe”, “state”, “translate” type questions). It’s particularly good at answering low demand, high difficulty questions (“Ublaahatkut” is Eskimo for “good morning” — my watch told me). But it’s not good for higher levels of cognition. Try asking your watch to summarise original writing or solve a unique problem.
Turing Test for Testing
So I propose a Turing Test for Testing (TTT or “T3"). The original Turing Test tests a computer’s intelligence; the Turing Test for Testing tests an assessment’s stupidity. A test’s T3 value can be measured. T3 is the sum of all of the marks for questions that can be correctly answered by computer. So, for example, a particularly bad test might have a T3 value of 50 (out of 100), meaning that half of the questions can be answered by a computer; a better test might have a T3 value of 20; a good test would have a T3 value close to zero.
Using the Turing Test for Testing would eliminate the worst aspects of contemporary testing. For one thing, all of the recall questions would go — including the “difficult” recall questions. Assessments would have to comprise more demanding questions that, at least, would get off the bottom of Bloom’s hierarchy. Tests would assess higher order skills such as interpretation, analysis, problem solving and evaluation — which are precisely the sorts of skills that we’re trying to foster. And the arms race between testing and technology would come to an end — meaning that learners could put their watches back on and reality could return to the classroom.
There would be resistance. Asking low demand questions might be pointless but they’re quick and easy to write. A great deal more thought would have to go into tests that pose questions that computers can’t answer. Examinations would also have to be less predictable — and there would be resistance to that too.
Some people have argued that education is really about signalling — that it has little real value beyond telling university admissions officers and employers that learners can apply themselves (start a course), see things through (complete a course) and stand up to pressure (pass a course). While there’s some truth in that (it does show those things), that sells education short. It’s been said, with some justification, that assessment drives learning — and to change the education system you have to start by changing the assessment system. Improving assessment is a start. Eliminating the worst aspects of assessment, such as getting rid of questions that your watch can answer, is a first step on that journey.
