Our priorities now, what we value now, are test results.
Ed Reform Starts with Assessment Reform
Howard Johnson


I tend to think this goes far beyond just education problems, but all corners of society. Something that I’d been thinking about a lot is that, in complex systems, data, even a lot of data, is never sufficient to provide definite answers to anything. Promoting a Bayesian (conditional probability centric) thinking helps, but this is difficult in an environment where people already think they know the questions and are looking only for “answers” and where many people are eager to give them ready “answers” churned out by algorithms.

To use a somewhat contrived example, if we observe a chicken crossing the road, there are many inferences that one can draw from it (that are not necessarily mutually incompatble): the chicken saw something on this side, the chicken liked the shade (and presumably the other side was shadier), the chicken just liked the other side, etc. All these can only be answered through additional data-theory interactions. No amount of data crunching with the data on hand could definitely answer this, although they can provide some clues for future investigation. Some possibilities might be mushier (more complex, more moving parts, and most importantly, more uncertainty), but they are not necessarily any less valid — and if anything, investigating them might be even more valuable precisely because of the “mushiness.” But they tend to be discounted in favor of the simpler, more straightforward. Since we have simple answers, we reject the hard questions, if you will. I suspect I might have just repeated your description of the follies of the “rationalist project.”

The problem with mushiness, of course, is that it does not clue in the audience as to what the point of the investigation is. People want to know the purpose of the investigation beyond simple pursuit of understanding. I thought about this when you bring up the linkage between student performance and “exam requirements.” Students want to know why they are being tested on what they are being tested on, and try to meet that rationale — and I came across this quite a lot in political science. A lot of students try very hard to gauge political bias of the instructor and try to sprinkle in flavor matching that bias in their answers, and I gave a lot of them real hard time since I absolutely hated (and came to hate even more) “politics” — an instructor whose first question (and a serious question, not a rhetorical one) was “what are the Democrats?” must have thrown a lot of them off. I don’t think it is fundamentally a bad thing where the expected answers are at least objectively defensible, even if not desirable — we do have formulas for a reason. But where the answers are inherently mushy and are given to biases, just giving in to oversimplifications and appeals to the “obvious” is fatal.

Your characterization of the statistical problems in standardized test design, I think, is a reflection of a much broader range of problems in how data is used. Again, the problem seems to be a lack of nuanced thinking, but a binary approach with a bias towards “right” (or wrong) answers worsens it — in fact, this is built into the desire for ranking the results along a unidimensional space. There are two related problems here: first, if there are questions that are answered correctly by 10% of the students, is it always the same 10%? They shouldn’t be, if there are different objects being measured, but if it is the same 10%, it certainly does make it easier to rank them. Now, the variability in the subset of the “right” answers will change as a function of the % of the right answers — if you have 10% right/wrong answers, the set of students giving right answers will be a lot less variabile than 50% right answers. But the desire for producing nice straightforward “ranks” would reduce even the limited variability that can be found in 10–90 subsets. (NB. The variability is potentially manipulable, if one tries. Not all people are given to think the same way. Different questions will throw off different people. (I violated the principle of not experimenting on students when I was setting up tests where I could throw off different students by playing to their biases, so in a sense, I really do deserve some of the resentment I got — but this was also a topic of serious political research, a better survey than the one that I was trying to set up (for real research questions, I used contrived/hypothetical candidates/votes for a lab audience, whereas the study linked incorporated the questions into a real survey and used actual votes in the Senate), where people who are more politically interested were more likely to get the facts wrong when well-known politicians voted unexpectedly. Now, one might say that, in order to throw different people off on different questions, i.e. produce a lot of different 10–90 combinations on a set of questions, a lot of contrivance would have to be incorporated. But that is the point — what kind of contrivance leads to how much variability. That seems more honest than contriving a test to produce a lot of predictable variations, i.e. same 10% get the right answers all the time, and pretend that the resulting straightforward ranks are natural. Yes, we contrived to make a completely different 10% get “right answers.” How natural is our contrivance? The linked study shows that if senators vote against their party, “well-informed” people are less likely to know than the not-so-well-informed ones. Can we dismiss this by saying that senators don’t (usually) vote against their party?)

PS. A rather longwinded comment, but I wanted to repeat the point about the desire for producing “ranks” — from a potentially flawed combination of theories and incentives — getting in the way of actually useful assessment as I think this is a major underlying problem in a lot of (ab)use of data. A commonly encountered simultaneity problem in political science is that of inferring “ideology” from observed behavior, but, for all manner of reasons, “ideology” is construed along a liberal-conservative dimension (i.e. unidimensional scale — if only because this makes things easy for everyone, while pro-local post office vs. anti-local post office makes no sense for people who are not local, to offer up another contrived example.). The statistical techniques used are biased to produce a unidimensional answer, and the actions of the survey respondents and even politicians are designed to create this seeming dichotomy easier — e.g. scheduling “fake” votes with no consequences just so that Democrats will vote with Democrats against Republicans voting with Republicans. At the end of the day, we have a bunch of numbers, but they reveal very little about underlying “politics,” in part because of the poverty in theorizing about possible sets of “odd” events that could unfold in politics, which, if found, are brushed under the rubric of “errors.” If the politicians/survey respondents were not gaming the system that much before, they are doing more, because they know what they are “supposed” to say to match the standard “Republican/Democratic” answers. The analogue, of course, is to how students try to game the exams by trying to gauge what they are expected to answer and giving the answers that are sought — because they are the “right” answers because the “textbooks” (or whatever other trusted sources) say so. I’d wager that the equivalent of this sort of data-manipulating gamesmanship takes place everywhere. And why not (putting on my cynical hat)? The “right answers,” which we know, because, data, are all that matters.

Like what you read? Give Henry Kim a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.