Why ‘What Works’ Doesn’t In Education Research

Jay Lynch
Age of Awareness
5 min readSep 27, 2018

--

If the quality and informativeness of education research is to improve, it will need to kick a bad habit — focusing on whether or not an educational intervention ‘works’.

And efforts to answer that question through null hypothesis significance testing (NHST), which explores whether an intervention or product has an effect on the average outcome, undermines the ability to make sustained progress in helping students learn. It provides little useful information and fails miserably as a method for accumulating knowledge about learning and teaching.

How does NHST look in action? A typical research question in education might be whether average test scores differ for students who use a new math game and those who don’t. Applying NHST, a researcher would assess whether a difference in scores is large enough to conclude that the game has had an impact, or, in other words, that it ‘works’.

Left unanswered is why, how much, and for whom.

This approach pervades education research. It is reflected in the U.S. government-supported initiative to aggregate and evaluate educational research, aptly named the What Works Clearinghouse, and frequently serves as a litmus test for publication worthiness in education journals. Yet it has been subjected to scathing criticism almost since its inception, criticism that centers on two issues.

False Positives And Other Pitfalls

First, obtaining statistical evidence of an effect is shockingly easy in experimental research. This is particularly true for educational researchers employing feeble controls, positing vague theories, comparing multiple variables, selectively reporting significant results, and using flexible data analyses. One of the emerging realizations from the current crisis in psychological research is that rather than serving as a responsible gatekeeper ensuring the trustworthiness of published findings, reliance on statistical significance testing has had the opposite effect of creating a literature filled with false positives, overestimated effect sizes, and underpowered research designs.

Assuming a proposed intervention involves students doing virtually anything more cognitively challenging than passively listening to lecturing-as-usual (the typical straw man control in education research), then a researcher is virtually assured to find a positive difference as long as the sample size is large enough. Showing that an educational intervention has a positive effect is quite a feeble hurdle to overcome. Combined with widespread publication bias in favor of positive findings, it isn’t at all shocking that in education almost everything appears to work.

But even if these methodological concerns with NHST were addressed, there is a second serious flaw undermining the NHST framework upon which most experimental educational research rests.

Null hypothesis significance testing is an epistemic dead end. It obviates the need for researchers to focus on specifying and developing testable models of their theories that can predict and explain an intervention’s effects. In fact, the only hypothesis evaluated within the framework of NHST is a caricature, a hypothesis the researcher doesn’t believe — which is that an intervention has zero effect. A researcher’s own hypothesis is never actually tested nor even articulated clearly. And yet with almost universal aplomb, education researchers falsely conclude that a rejection of the null hypothesis counts as strong evidence in favor of their preferred theory.

As a result, NHST encourages and preserves hypotheses so vague, so lacking in predictive power and theoretical content, as to be nearly useless. It has been described as a “sterile intellectual rake”, an activity that “retards the growth of scientific knowledge.”

And contrary to widespread beliefs, finding that observed data is unlikely under the null hypothesis (e.g., p<0.5) does not provide evidence for accepting or rejecting any hypothesis because the null is the only theory under consideration.

Just because data is improbable under the null of zero effect doesn’t entail it is more probable under some alternative theory.

As researchers in psychology are realizing, even well-regarded theories, ostensibly supported by hundreds of randomized controlled experiments, can start to evaporate under scrutiny because reliance on null hypothesis significance testing means a theory is never really tested at all. As long as educational researchers continue to rely on testing the null hypothesis of no difference as a universal foil for establishing whether an intervention ‘works,’ we will struggle in our efforts to improve our understanding of how to best help students learn. And the field of education will continue to be dominated by “explanation-less collections of observations — that is, mere ‘stamp collecting’” (Ashton, 2013, p. 585).

As analysts Michael Horn and Julia Freeland have noted, this dominant paradigm of educational research is woefully incomplete and must change if we are going make progress in our understanding of how to help students learn:

“An effective research agenda moves beyond merely identifying correlations of what works on average to articulate and test theories about how and why certain educational interventions work in different circumstances for different students.”

Yet for academic researchers concerned primarily with producing publishable evidence of interventions that ‘work,’ the vapid nature of NHST has not been widely acknowledged as a serious issue. And because the NHST approach to research is straightforward, intellectually undemanding, and relatively safe (researchers have an excellent chance of getting the answer they want), it isn’t surprising that there has been little incentive to change.

Moving Forward

Rather than being satisfied with answering the question of whether or not a product or intervention ‘works’, education researchers can improve the reliability of their findings as well as contribute to a better understanding of how to help students learn by modifying their approach in several ways.

  • Recognize the limited information NHST can provide. As the primary statistical framework for moving our understanding of learning and teaching forward, it is misapplied because it ultimately tells us nothing that we actually want to know. Furthermore, it contributes to the proliferation of spurious findings in education by encouraging questionable research practices and the reporting of overestimated intervention effects.
  • Instead of relying on NHST, researchers should focus on putting forward theoretically informed predictions and then designing experiments to test them against meaningful alternatives. Rather than rejecting the uninteresting hypothesis of “no-difference,” the primary goal should be to improve our understanding of the impact that interventions have, and the best way to do this is to compare models that compete to describe observations that arise from experimentation.
  • Rather than dichotomous judgments about whether an intervention works or not on average, greater evaluative emphasis should be devoted to exploring the impact of interventions across subsets of students and conditions. No intervention works equally well for every student and it’s the creative and imaginative work of trying to understand why and where an intervention fails or succeeds that is most valuable. We must learn to embrace uncertainty and accept variation rather than ignore it.

References

Ashton, J. C. (2013). Experimental power comes from powerful theories — the real problem in null hypothesis testing. Nature Reviews Neuroscience, 14, 585–585.

This piece originally appeared in modified form on EdSurge.

--

--