“How Much Should I Care?” Five questions policymakers and practitioners should ask when sizing up education research
Making sense of new research means asking, “How much should I care?” Faced with limited resources, political realities, and myriad other constraints, people who shape policy and practice need to understand the magnitude of the opportunity implied by some new set of findings. Lacking that understanding, a study may seem interesting (at best), but not useful.
Hence translations of research for nontechnical audiences often include comparisons to measures with which laypeople are familiar. I once saw a roomful of education leaders go from “meh” to OMG” when a presenter switched from standard deviations to roughly equivalent months of additional student learning. Such conversions require caveats, but without them, the layperson walks away with a shrug.
My thinking on this has recently broadened, thanks to Matthew Kraft, professor of education and economics at Brown University. Kraft and his collaborators have a knack for examining widely touted improvement strategies from new angles, and in doing so producing far more nuanced (and helpful) findings than simply “it worked,” or “it didn’t.”
Here’s an example: Eight years after the famous “Widget Effect” report showed 98 percent of teachers received performance evaluations of satisfactory and above, Kraft and colleagues found virtually the same pattern, despite a massive investment during the intervening years in policies meant to improve teacher evaluation.
While that finding is important by itself, their paper included insights from interviews with principals who told why they gave satisfactory ratings to teachers they knew were ineffective. Among the reasons: to avoid conflict, and the belief that it made it easier to support a teacher’s growth. Whatever you think of such reasoning, knowing it provides a more complete understanding of the problem than just knowing the problem exists.
The most recent analysis by Kraft to catch my eye takes aim at how researchers write and talk about the importance of a finding, given the size of the effect. He makes his case in a working paper written for a scholarly journal. But his argument for considering a wider set of factors when making a claim as to the significance of a finding is relevant to anyone who uses social science research. Not surprisingly, his paper has drawn considerable attention on social media.
Kraft begins his missive with a helpful short history of conventions for interpreting effect sizes in the social sciences. Commonly used standards for calling a result “small,” “moderate,” or “large,” are mostly based on studies that took place some 40 years ago, and entailed very different interventions, comparison groups, and outcome measures than are the focus of many studies today. To call an effect “small” based on comparison to results from studies with much narrower parameters is, he argues, misleading.
To prompt deeper discussions of what constitutes a small vs. large effect, Kraft presents a set of guidelines for considering the importance of a study’s results. Because I think this discussion deserve wide participation, I’ve translated his key points into a set of questions to ground conversation among researchers and research consumers. Anyone who makes use of research to inform decisionmaking should be asking these, and the answers should color any thinking about implications (Many thanks to Kraft for providing feedback on these.):
How much should I care given the study included (or didn’t include) randomization? We all know the mantra, “correlation is not causation.” Just because two things are generally related, that doesn’t mean one is causing the other. Taller students tend to read better than shorter ones, but that doesn’t mean height improves reading ability; older students are generally both taller and better readers.
As Kraft notes, however, correlations often get reported in ways that sound like causal effects. A research write-up will say that a difference of X amount of one thing is “associated” with a difference of Y amount of something else. To the untrained ear, that sounds a lot like the one is causing the other, and that given X amount of the first thing, you’ll get Y amount of the second.
Here’s why else this matters: Correlations reported as effect sizes often sound impressively large, when in fact they say nothing about how much one thing causes another, or even if a causal relationship exists at all. Conversely, true causal effects — revealed through randomized experiments — rarely seem so big, but we know the effect is real because randomization removes any bias in selecting who got the treatment.
We should take it with a grain of salt when a study based on a simple correlation shows an impressive association among different factors. Meanwhile, when a study based on random assignment shows what seem to be small effects, that’s nothing to sneeze at. Even a small effect that shows up after random assignment is worth notice.
How much should I care given how the results were measured, and when? Which is more impressive: Hitting a big target from four feet away, or hitting a small one from 100 feet? We often look for the effect of an improvement strategy at a great distance from the point of implementation. To study the effects of instructional coaching, we look for changes in student test scores at the end of the school year. But any signal from what happens between a coach and a teacher has traveled a long way by the time that teacher’s students take their annual exams.
To further push the above analogy, a target is designed to do one thing. It’s very good at telling you how accurate you were. A standardized test is designed to measure what students know and can do, and for a narrow range of skills and content. Changes in test scores will reflect a lot more than just the effects of coaching, while at the same time they won’t reflect everything that coaching was meant to improve.
Why does this matter? Often, it gets reported that a strategy for instructional improvement is more effective in changing teacher practice than it is at raising student learning. This creates the impression of failure, or at least disappointment, because the thing we care about most changed less than what we think of as a means to a more important end. But as Kraft points out, it makes perfect sense that teaching would change more than student performance.
We should take note whenever an improvement strategy shows an effect on some distant measure, even when that effect seems small. Likewise, we should not be surprised when things closer to the mechanisms for improvement show greater changes. To improve student learning requires working through intermediate steps that are always more likely to change as a result of an intervention than are the outcomes that matter the most to us.
How much should I care, given who’s compared with whom. Some strategies have bigger effects on some populations than others. Kraft points out that growth mindset interventions often are more effective with students from marginalized communities than with students who haven’t faced as much hardship. Hence, studies of growth mindset strategies applied to the general student population will show smaller effects than studies that just include students most likely to respond to such strategies.
A program’s effects will also seem smaller when the comparison group benefits from supports that mimic some of those given the treatment group. Kraft explains that some studies of Head Start compare Head Start participants to a population of children that includes many who take part in some other type of early childhood education. Such studies don’t really tell you the difference between getting Head Start and getting nothing; a study of that would likely show bigger effects.
This matters because it’s easy to reject an overall strategy based on research in which the comparison group is from the general population and includes individuals who may be doing something similar to the program being studied. If kids in Head Start do little better than kids in the general population, then maybe early childhood education in general is not worth the investment. That’s a faulty conclusion.
To be sure, it’s often not possible to find a comparison group that isn’t doing anything akin to the strategy being studied. We can’t tell parents, “for the sake of our experiment, keep your kid out of any kind of preschool.” But when the comparisons in a study may be necessarily muddied, we should view even a small effect as potentially important.
The above three questions serve a similar function: To prompt discussion about how a study’s design should inform how we think about its implications. In his paper, Kraft adds two more considerations that are more about context, but equally important. Here I’ve translated them into additional prompts to help research consumers and communicators think about the extent to which some new finding is a big deal or not.
How much should I care given the cost of implementation? Some strategies produce minimal effects, but they’re also cheap. Take programs that send people text-message reminders. Kraft notes such programs have been found to reduce student absenteeism and increase parental engagement in their children’s academic development. The effects aren’t big, but neither is the expense of implementation. At a cost of a few dollars per participant, why not give students the added advantage, as small as it might be?
Kraft doesn’t argue for dismissing more costly interventions. But he does say cost should be part of the conversation when results are discussed. Moreover, he points out that not all costs are monetary. Demands on people’s time should be taken into consideration when taking stock of a strategy’s observed effects. Sizing up the potential return-on-investment requires knowing what that investment is.
How much should I care given how likely the strategy could be replicated elsewhere under similar conditions? Scaling up without losing effectiveness is a perennial challenge in American education. Kraft has seen this in his own analyses of instructional coaching: When coaching programs are employed in many more school systems, their effectiveness decreases. In our country’s highly diverse and decentralized education system, resources and conditions vary significantly. Chances are those conditions and resources may affect a program’s effectiveness, or the extent to which a program may be implemented as intended.
Kraft says we shouldn’t avoid improvement strategies that are difficult to scale. As he notes, school improvement requires behavioral changes that push teachers and administrators beyond their comfort zones. If we only did what’s easily scaled, we’d never tackle the hard but necessary work of organizational change. But doing the hard work requires attention to the conditions in which improvement strategies have been successful. If we can’t replicate the necessary conditions, we can’t expect the same effects.
An underlying message throughout Kraft’s paper is that moving the needle in education is hard, and necessarily so. And yet, in this field we’re quick to throw up our hands and say, “it didn’t work, let’s try something new.” That only makes the challenge harder. By understanding why some effects may be smaller than others — and when small effects may be a bigger deal than they initially seem — we’re more likely to know when to say, “hang on, maybe there’s something here.”
Then discussion can turn to the more important question: “So what do we do now?”
Jeff Archer is president of Knowledge Design Partners LLC. KDP supports education-focused foundations, nonprofits, and research groups with content-development, knowledge management planning, and communications. On Twitter at @KDPartners_LLC