Are Standardised tests a sleight of hand?

Is it possible that the Standardised tests that students across the world have faced for decades could in fact turn out to be a mathematical sleight of hand? Todd Rose thinks so based on work by Peter Molenaar. And it’s not just testing that is in the firing line — admission tests, mental health tests, personality tests, brain models, IQ tests, depression treatments, hiring policies all suffer the same fate. Here’s how.

Rose’s book traces his argument through four historical characters, each tracing a path in the development of the social sciences. The story begins with Adolphe Quetelet born in 1796. A career in astronomy thwarted by revolution he dedicates his life instead to the science of man in an attempt to counter the effect of social unrest.

A key technique used in astronomy in Quetelet’s time was designed to deal with the fact that ten astronomers measuring the speed of the same object using the instruments available at the time would invariably come up with ten different measures. How to decide which one to use? The solution was the method of averages. This technique was claimed to be more accurate than any one measure. Quetelet’s legacy is that he took the method of averages — and applied it to man. When measuring the chest size of 5,738 soldiers, possibly the first average calculation of any human feature, Quetelet concluded that each individual soldier’s chest size represented naturally occurring error, and so the average chest size represented the ‘true’ size of the soldier. In 1830 Quetelet’s averages were startling news; to be able show suicide, something seen as highly irrational that could not possibly conform to a pattern, as in fact reliable and consistent when using the method of averages and across a population in any one year, was quite simply revolutionary.

Next enter Francis Galton stage right, one of Quetelets’s early disciples who called him “the greatest authority on vital and social statistics”. Galton subscribed to all of Quetelet’s ideas bar one — the notion that Average Man represented Nature’s ideal. Far from it, Galton believed average to be mediocre and crude. Galton instead believed that those who were above average were the “Eminent”, those who are average “Mediocre”, those below average the “Imbecile”. How did Galton reconcile the idea that average defined type whilst rejecting at the same time the idea that average represented error: by changing ‘error’ to ‘rank’.

Galton went further claiming that a person’s rank was consistent across all qualities and dimensions — mental, physical and moral. To help him prove the existence of rank Galton developed new statistical methods, including correlation, that allowed him to show the relationship across different qualities.

Quetelet started by showing that a person’s worth could be shown by how close they were to the average. Galton’s idea was that it was instead about far they were from the average.

As a consequence by the 1890s the notion that people could be sorted into distinct categories had infiltrated all the social and behavioural sciences. In Rose’s mind these two ideas combined to produce our current system of education, hiring practice and employee performance evaluations right across the world.

The third character in the jigsaw is Frederick Winslow Taylor who was educated in Prussia, one of the first countries to reorganise schools and the military around Quetelet’s ideas. When he returned home to Pennsylvania, rather than go to Harvard to study law like his father he instead became an apprentice at a pump manufacturing business. The US was moving to an industrial economy, Taylor wanted to be part of this exciting new area — Rose compares Taylor to Zuckerberg, dropping out of Harvard to create Facebook.

Quetelet solved societal problems with a science of society. Taylor looked to tackle industrial upheaval with a science of work by adapting averagarianism to the work place, “the idea that individuality did not matter…In the past the man was first….in the future the system must be first”. In this vision of the work place, creativity and individuality played no part: “An organization composed of individuals of mediocre ability…will in the long run prove more successful…than an organisation of geniuses led by inspiration”

According to Rose, Taylor’s inspiration for standardizing labour came from his maths teacher who would time how long students took to complete a maths problem and then used this data to calculate how many questions to set so that the average boy finished all of the questions in 2 hours. Taylor applied the same principle to standardise industrial processes. According to Taylor, there is one best way for any given process, and only one: the standardised way. Of course this left the question of who should create the standards in the first place and thus the manager was born, the new class of planners.

With thinking and planning cleanly separated from making and doing, a new appetite for experts in thinking and planning emerged: management consultants, and Taylor was the first. By 1927 the League of Nations called Taylorism a “characteristic of American civilization”.

How does society decide who gets to be a worker and who gets to be a manager?

In 1900 in the US just 6% of students graduated from high school, and only 2% from college. The need for mass education quickly grew as industrialisation took root, and when it came to casting national education policy, Taylorism won again. Education it came to be seen, should be designed to prepare students en masse for Taylorist industry: “Taylorists argued that schools should provide a standard education for an average student instead of trying to foster greatness”. School hierarchies developed to reflect Taylorist management structures, with curriculum planners tasked with standardising curriculum and grades, the school day and pedagogies reflecting Taylorism in the workplace.

Enter Edward Thorndike, who embraced Taylorism to determine superior students from inferior ones

Thorndike disagreed with Taylor that the goal of education should be to provide every student with the same standardised education. Thorndike believed that schools should instead sort young people according to their station. It was thus that Thorndike created standardised tests for handwriting, spelling etc. The purpose of school for Thorndike was not to educate, but to sort, to rank: “Today Thorndike’s rank-obsessed educational labyrinth traps everyone within its walls” — teachers, schools, universities…all are ranked…..businesses base their hiring decisions on the same ranks….educational systems of entire countries are ranked (PISA)”. And more, “we have perfected our educational system… a well oiled Taylorist machine…..efficiently p56 ranking students in order to assign them to their proper place in society”.

The flaw in averageism

Test theory was codified into its modern form in the 1968 text book “Statistical Theories of Mental Test Scores” by Frederic Lord and Melvin Novick. In it, all mental tests attempt to ascertain an individual’s true score. According to classical test theory, the only way to do this would be to administer the same test to the same person over and over again, since there will be error involved in each of their attempts. However, in practical terms this cannot be done since the student would learn how to respond and the test would cease to be objective.

The solution is to substitute a group distribution of scores for an individual’s distribution of scores. In the late 1800s, the same approach was being used in a variety of fields in physics where scientists were measuring the average properties of gas molecules to predict the average behaviour of an individual gas molecule. The same assumption now serves as the basis for every field of research in social science, in school testing, admission tests, mental health tests, brain models, depression treatments, hiring policy and so on..

This whole approach however has been brought into question by Peter Molenaar, addressing the ”paradoxical assumption that you could understand individuals by ignoring their individuality”. This Rose calls the Ergodic switch. According to Ergodic Theory, you are allowed to use a group average to make predictions about individuals if two conditions are true: 1) every member of the group is identical, 2) every member will remain the same in the future. So here is the sleight of hand: standardisation, the practice of assuming that you can say something useful about an individual based on comparison to a group average, assumes that humans are ergodic. And obviously they’re not.

To illustrate where it all goes wrong Rose uses a typing analogy. Imagine we want to work out how to reduce the number of errors from a typing pool. If you look purely at the errors mapped to the speed of typing you would be forgiven for thinking that typing faster will produce less errors. But that’s because a professional typist is likely to be faster and more accurate. Typing faster doesn’t automatically make you more accurate, although it appears so at the group level. If instead you looked at the problem at the individual level you would quickly see that for more novice typists typing faster actually makes them much less accurate. You might though say that no-one would try to solve that problem by taking only one measure and a pretty silly one. You might say that the most likely measure to ‘predict’ errors in this case is the degree of training. And of course that is the point. Surely it would be silly to give students a test and simply measure their score and how long they took to complete it and then say the score they achieve is representative of their ability without taking into account factors such as the degree of training and any number of other factors.

And so we end up with “financial credit policies that penalize credit-worthy individuals, college admission strategies that filter out promising students and hiring policies that overlook exceptional talent”. It is not that averages are not useful; they are when talking about inter-individual studies. It’s when the data is used to say something intra-individual; that is where it falls down.

What does all this mean for assessment and testing?

For Rose, the conclusion is that Standardised tests are at best a rather blunt and crude instrument. The backwash effect of the tests on the curriculum is a one size fits all approach targeting the average, not the individual. So what should we be doing instead? Rose likes the Competency Based Approach to both learning and assessment. Competency based assessment is all about assessing capability and referencing assessment objectives or criteria. Right now in the United States there is a big push on the Competency based approach, driven to a large degree by the White House which, put simply, has become rather upset about the value for money the States are getting from colleges, particularly private colleges, who have been abusing the weak validity of automated standardised tests in order to claim their credit hours from central government: a number of high profile for profit colleges have come under increasing scrutiny for their questionable enrollment practices and poor quality curriculum. In the land of the free, freedom to offer a private education without regulation has been a long standing principle.

Competency based education is an opportunity to reward prior learning, detach the dependency around credit hours which effectively ties a student to classroom time, on or offline, and to accredit students for mastery of specific skills, not some aggregated, normalised score that measures performance on the course, whatever that course might be.

And that indeed is where Todd Rose’s story ends. But of course we have had the Competency based approach in the UK for over 30 years. Infact it was introduced to the UK to clean up some of the dubious practices undertaken here as well. Yet it hasn’t been the panacea that Rose might have expected. Infact, it has tended to dominate vocational education, and be rather looked down upon by those in academia. The reasons for this I believe come down, in part, to the question of how do you create descriptors for competences that describe what are seen as subjective: how do you write, for example, a competence for creative writing, for critical thinking, for collaborative work. Competences are great for knowledge and understanding, not so great for skills. And that’s why on my own journey through assessment, I have been a great advocate for Comparative Judgement.

Comparative Judgement is an alternative to marking that celebrates the fact the humans find it much easier to make relative judgements than absolute judgements. There is evidence to suggest that this reflects the way that our brains work. In comparative judgement, judges as the assessors become, make binary judgements between pairs of pieces of work in order to aggregate their judgements and those of their fellow judges into a scaled rank. This scaled rank has been shown to be much more consistent between judges than marking. This then means that rather than shying away from authentic assessments that challenge reliability in a way that a multiple choice objective test never does, we can instead embrace them wholeheartedly. The use of comparative judgement, applied to competency based assessment would therefore allow the United States to perhaps short circuit some of the problems that the UK system has endured.