Low-stakes Testing Is Good

It helps to be very clear about what we actually seek to achieve by giving someone a test.

David Moore
Educate.
8 min readMar 10, 2021

--

There’s a somewhat frustrating series of arguments going around right now about standardized testing, admissions, anti-racism, and a few other related topics. There’s Matthew Yglesias writing an op-ed in the Washington Post that Ibram X. Kendi’s anti-racist response to testing ignores real academic disparities that can’t all be attributed to different forms of cultural knowledge. There’s Jamaal Bowman linking to an NEA report on the racist legacy of standardized testing and adding that standardized testing is a “pillar of systemic racism.” There are arguments about eliminating testing as a means of screening students for magnet programs. And on and on.

This is a confusing subject, not least because a lot of it depends on an “all of the above” base of knowledge to figure out what’s going on: on-the-ground experience as a K-12 educator (Bowman), historical thinking (Kendi), and some level of understanding the politics of education, which are complicated and sometimes don’t abide by the usual routes of polarization. So in these discussions, I think it helps to be very clear about what we actually seek to achieve by giving someone a test.

(1) In some cases, tests have an exclusionary screening function. This is the legacy of the development of some American standardized tests, inextricable from their racist and segregationist roots. Some contemporary tests in this category, for current magnet programs or college admissions, say, still serve a screening function that is probably inevitable: there are limited slots. It is debatable whether these tests are themselves drivers of racist exclusion or whether they reflect racist exclusion already occurring. Perhaps on the margins some of these contemporary tests reduce the inequity of an alternative that involved no screening mechanism, the benefits of which would redound to people with the right social connections.

(2) In other cases, tests have a diagnostic function. The goal of this sort of testing is, as accurately as possible, to assess progress in various areas for potential intervention. These kinds of assessments are used all the time in schools, and they are often invisible to commentators who don’t have direct experience with teaching in educational settings (including a whole alphabet soup of diagnostic programs and companies). They are not what most people are referring to when they talk about “standardized testing culture” or “teaching to the test.” Although these diagnostic tools can have huge implications for student support and placement within schools, they usually are not used for external screening purposes. When such diagnostic tools are used for internal screenings, they sometimes contribute to structural concerns with “academic tracks” and other dubious ways of organizing students within classrooms. But again, this is different from tracking school progress. Which leads to…

(3) In some cases, supposedly diagnostic tests become tools for de facto screening and exclusion by tracking the progress of a school or district. These are the kinds of statewide mandated tests that most teachers and principals (like Rep. Bowman) are referring to when they decry the legacy of “standardized testing” or “teaching to the test” in a K-12 context. These tests usually do not track student progress at the individual level for things like graduation requirements, though some places have such requirements. The ostensible purpose of these progress tracking tests, which proliferated under No Child Left Behind in the 00’s, is to diagnose the efficacy of schools and school districts, and, within districts, sometimes to diagnose the efficacy of teachers. What they often wind up doing is reducing the autonomy, resources, and status of schools based solely on student performance, which itself is highly correlated to the socioeconomic status and resources available in the place the school is located.

Although these three types of testing — screening, diagnostic, and tracking — are all given while students are in K-12 schools, their roles, roots, and impacts vary a lot, and they’re used differently in different contexts. Although all three kinds of testing provide data about students, we usually distinguish between high-stakes testing, of the sort used for school accountability, and other forms of testing, whether these tests are actually high-stakes for the people who take them (for college entrance) or low-stakes (for open-ended diagnosis and support).

When I was a student in Maryland, I took early versions of the MSPAP — the Maryland State Performance Assessment Program. These resembled high-stakes tests, but in fact, they were low stakes when I took them. There was no impact on our academic progress, our teachers’ job security, or to the standing of our school within the district. By 1997, the test was indeed being used as an accountability measure, but the test was also under fire for not being standardized enough through multiple-choice questions. The test was scrapped, and its replacement in 2002 was more targeted to track schools (to comply with No Child Left Behind) and to track individual students, which the MSPAP never did:

Maryland educators said the test results will yield two scores: an individual score of “basic, proficient or advanced” that refers to how well a student mastered the state curriculum, and a numerical rank that educators can use to compare with other states.

Grasmick said Maryland is the first state to design a test under No Child Left Behind (NCLB) requirements that would produce a local and national measurement.

What made the new test high stakes in 2002 in a way it wasn’t in 1992 was that schools were being judged internally, by students’ individual performance, and externally, within the state and nationally. That combination of high stakes and low standardization (no national curriculum, no one test across all states, etc.) proved a volatile mix, exacerbating funding inequities and poisoning the general environment of schools for students and teachers, especially those that needed the most support. In the past decade, we have witnessed a growing backlash against this kind of standardized testing, exemplified by Diane Ravitch’s work on how she shifted her own views on this kind of testing in books like Reign of Error. Such testing fits Kendi’s description of a mechanism that is racist in effect regardless of intent because it actively amplifies the racism of the American education landscape.

By comparison, a diagnostic test like the Program for International Student Assessment (PISA), an international test that compares student progress across different countries (not without its own complexities and problems), provides a snapshot of student progress without tying this project to explicit funding mechanisms for American schools. For example, a Pennsylvania school’s performance on PISA wouldn’t be on the radar screen of the PA legislature.

There are tests that are high stakes in the non-NCLB-jargon sense precisely because screening functions are necessary, and it’s just not that easy to screen lots of people. The SAT is by no means a perfect tool for screening for many reasons (something that Kendi writes about in detail in Stamped), but absent any standardized testing, most college admissions without a more proactive recruitment strategy would likely focus on students’ grades, recommendations, academic history, and broader social support for the college application process to aid in matching, applying, creating a narrative and materials for colleges. It’s not clear that the specific inequities in college admissions would budge, whereas other supports, like a proposed “SAT bump” based on demographic information, could become a means of getting more students matched to the right colleges for students without academic or social support who happens to score well on the test.

Mixing up all of these different functions of standardized testing makes a bit of a hash of the areas in which standardized testing creates inequity (the historical roots of some such tests), sustains or fuels inequity (the use of these tests in a high-stakes testing environment), or potentially combats inequity (any testing that helps to replace or disrupt systems otherwise built on social connectedness and affluence). My sense is that low-stakes testing of various kinds is potentially helpful in combating inequity: diagnostics without consequences really do help teachers, administrators, and at the broader level districts identify the details on the ground — what students can do now, what they should be able to do, and what supports will help. Meanwhile, environments that seem to call for high-stakes testing could at least be designed or augmented to build equity into inequitable processes — but in those cases, there is usually a much bigger issue pointing to why the resources are so scarce, to begin with.

Accurate diagnostic tools are critical for identifying areas of need and helping to allocate resources that support improvement. This is most obvious in both the most local and most universal applications of diagnostics, like in informal classroom assessments on one end or broad population studies on the other. Each of these is low-stakes to the person doing the testing and the person taking the test, but the right allocation of resources for educational interventions makes a huge difference and one that you can’t necessarily make without any data at all.

As for high-stakes screenings, they’re inevitable when there are scarce resources — available slots at prestigious universities, say. And here institutions should aim to adopt the kinds of standardized assessments that help reduce inequity when they can. But this kind of solution alone is not going to transform an unequal field, which is why increasing opportunities more broadly, like expanding college access and reinvigorating public college systems, has to be a major part of a solution, along with reimagining the broader goals of individual highly elite institutions. Otherwise you end up in a position like the classical music world, where blind auditions did decrease inequality to a point, and then plateaued due to the hyper-competitive nature of the field and a lack of more active and nuanced recruiting measures to increase diversity.

So my sense is that we should be moving toward the championing of low-stakes testing whenever possible, and using places where we find high-stakes testing environments as an opportunity to reflect a bit on the conditions in which scarcity, or punitive testing environments, have taken root. The increasing competitiveness of some colleges and predatory practices of other colleges suggest that helping kids get into Ivy League schools is only a tiny piece of the larger systemic problems in higher education and what kinds of new models or thinking could move us away from that scarcity. “Standardized tests” as we now know them (as accountability mechanisms) probably need to go. But for public discussion of these different kinds of tests, more precision would help figure out where testing is hurting and why — so that we can get rid of the tools that aren’t helping — but also imagine ways to build new tools, or use existing ones, that help us identify and address inequities.

Subscribe to Insights from Educate for a midweek dose of professional learning and inspiration with the latest news and research from the education industry.

--

--