Did the Marshmallow Test Really Get Debunked?

5 min readApr 26, 2019

The test actually does predict later academic achievement. What that means is an open question.

Not too long ago you may have heard that yet another seminal finding in psychology had bitten the proverbial dust. This time it was the beloved “marshmallow test”. The classic finding is that kids who can resist the temptation to eat a marshmallow placed in front of them, while waiting for a second marshmallow to be delivered, are more likely to thrive later in life academically and behaviorally. According to the popular press, this result “failed to replicate” in a new study, and had thus been “debunked”, adding to the sad heap of replication failures in psychology.

Most readers didn’t rush to take a closer look at the source of these claims, which was an empirical paper published in the journal Psychological Science and authored by Tyler Watts, Gregory Duncan, and Haonan Quan. Perhaps this is because, given the “replication crisis” in psychology, the news was not terribly surprising. On the other hand, it was sensational, and so it prompted a lot of tweeting, retweeting, and retweeting of comments on tweets. The original work by Walter Mischel began in the 1960s at Stanford University and spawned decades of research. It can be counted among the most widely known findings in psychology. It’s literally the only developmental psychology research finding that I can bring up to anyone I happen to be talking with (at the coffee shop, in the park, at a family gathering) and expect that they will nod with familiarity. So perhaps people ran with this news because it fit their schema of the current state of the field of psychology and was also good gossip.

However, on a closer read one can see that the media got it wrong. As my co-authors and I argue in a new scientific commentary about the research, some of the authors’ conclusions (which the media went off the rails with) do not follow from their results.

It boils down to this. In some of their statistical models, the authors statistically controlled for things that likely captured variation in executive control processes, which are what allow us to do things like keep goals in mind (like waiting for two marshmallows) and inhibit urges (like eating a marshmallow sitting in front of us). As we discuss in the commentary, these executive processes likely play a supporting role in delay of gratification itself, so measures that capture executive processes should not have been included as control variables. Essentially the analyses were throwing the baby out with the bathwater.

Likewise, in other statistical models reported in their paper, the authors controlled for child background and home environment, variables which similarly may have captured things like social norms and trust that may be at play in the development of delay of gratification skills and also in the moment that a child is attempting to resist the marshmallow in front of them. For example, if a child comes from a home where delaying gratification is not emphasized or valued (maybe because doing so may be too risky when resources are scarce), then they may not have the cognitive skills or strategies to support delaying in the marshmallow test even if they want to.

Children struggle to delay gratification and wait for a second marshmallow, but being able to predicts academic achievement later in life. Exactly why is an open question.

Finally, the authors also reported simple correlational models — that is, models with no control variables included — that tested whether delay of gratification predicted later academic outcomes and behavioral problems. These models are the only ones that can be said to really “count” as attempts to replicate the original work because the original work did not include any control variables. And in these models the finding that delaying on the marshmallow test predicts later academic achievement indeed replicated. (The size of the effect was smaller than the original, but this is not surprising given that effect size estimates increase in precision with larger samples.) The models predicting behavioral problems were not statistically significant. So a more accurate characterization of the findings that the press could have provided, at least with respect to the authors’ stated goal of replicating the original work, is that they represent a partial¹ replication rather than a failed one.

None of this is to say that the new study by Watts and his co-authors is not a valuable contribution to science. To the contrary, it’s a rigorous and much needed replication of a classic result. The new study was conducted with a much larger and diverse sample than the original, which involved families affiliated with Stanford. As others have rightly pointed out, the original study wasn’t exactly representative of the general population. Developmental psychology needs more replication studies with larger and more diverse samples of children, especially when there are real stakes (e.g., federal grant money being awarded, interventions in schools being conducted). We are definitely headed in this direction, with many replication projects and workshops underway. I believe the field of psychology as a whole will emerge from the so-called replication crisis with better theories, more reliable and credible findings, and renewed public trust.

But for now, let’s keep in mind that although the “crisis” hit psychology (and other fields) hard, many research findings are reliable and important, and worthy of their place in the textbooks. The marshmallow test and its prediction of academic achievement many years later is one such finding, and future work can and should explore the implications for improving children’s academic outcomes.

¹ It should be noted that the measures of behavioral outcomes used by Watts and his co-authors were parent-reported internalizing and externalizing behaviors, which were quite different from what was used in the original work by Shoda, Mischel and Peake (i.e., an index of adolescent coping).

Did the Marshmallow Test Really Get Debunked?

The test actually does predict later academic achievement. What that means is an open question.

Written by Sabine Doebel