Image adapted from: Student (1908). The probable error of a mean. Biometrika, 6(1), 1–25. doi: 10.2307/2331554

Scenes from the Replication Crisis

by John Borghi

2004
I am a first year student at a small liberal arts college. The history and legacy of the college are so tied into that of psychology that a visit from Sigmund Freud is commemorated by a statue central to campus life while a similarly historic visit from the neuroscientist Santiago Ramon Y Cajal is recalled mostly as a footnote.

Eager to pursue my burgeoning interest in psychology, I enroll in a quantitative methods course. Struggling through the course’s lab section, I learn to use statistics software to apply statistical devices with names like Student’s t-test and ANOVA. Because it’s stated explicitly in my lectures and implicitly in lab my exercises, I learn that good things can only happen when p is less than 0.05.

Believing the class will involve a great deal of math, I purchase a calculator that I never use. Instead of calculating and comparing means medians, and modes, I spend a great deal of time memorizing the instructions and user interfaces associated with running statistical tests in a program called SPSS. I do this, so I can restate them verbatim on my exams.

In the midst of late night study session a classmate looks at me in frustration and says, “I think I know what we’re doing, I just don’t know why.”.

1925
Ronald Fisher is a geneticist and statistician working at Rothamsted Experimental Station, an agricultural research institute located in the English countryside. Long-term experiments with wheat, grass, and roots abound at Rothamsted, giving Fisher a bumper crop of data to analyze. However, though the overall quantity of data is high, sample sizes are low. An influential study of the effects of rainfall on wheat incorporates data from just thirteen plots of land.

Concerned with generalizing the results of such experiments to broader populations, Fisher synthesizes several recent advances in “small sample statistics” into a framework known as significance testing. He expands the utility of Student’s t-test, a statistical device initially developed by statistician William Sealy Gossett to monitor the quality of Guinness, and develops a complementary test known which he calls the Analysis of Variance (ANOVA).

To ensure these innovations are accessible to the research community beyond Rothamsted, Fisher publishes Statistical Methods for Research Workers. Central to the book, and significance testing more generally, is the null hypothesis- the position that there is no significant difference between groups of data. In Fisher’s conception, devices like t-tests and ANOVAs are tests of the null hypothesis. The results of such tests indicate the likelihood of observing a result when the null hypothesis is true. In quantitative terms, this likelihood is expressed as a p-value.

Fitting it’s origins in applied research, the utility of Fisher’s framework is best demonstrated with a practical example. Suppose Fisher and his colleagues want to study the effect of a particular method of fertilization on the growth of grass. To do this, they obtain yield measurements from ten plots that use the method and ten that do not. These numbers are small, but reflective of the time and effort that goes into harvesting good data. Before examining the two groups of data, Fisher reminds his colleagues that the null hypothesis stipulates that there is no difference between the fertilized and unfertilized plots. This is a decidedly abstract way of talking about something as exciting as the growth of grass, so he reiterates that the null hypothesis is essentially that the fertilization method has no effect. Then, he runs a t-test.

A resulting p-value of 0.50 indicates that, assuming the fertilization method has no effect, the probability of Fisher and his colleagues obtaining their yield measurements is fifty percent. A resulting p-value of 0.10 indicates that the probability is ten percent. In Statistical Methods for Research Workers, Fisher introduces an informal criterion for rejecting the null hypothesis: p < 0.05.

“The value for which p = 0.05, or 1 in 20, is 1.96 or nearly 2 ; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant.”

This will become a point of contention in quantitative methods classes and psychology research papers for nearly a century, but Fisher’s p-value does not translate to a probability that the null hypothesis is true. To return to our practical example, a p-value of 0.05 does not indicate that that there is a five percent chance that the fertilization method has no effect on grass yield. Rather, assuming the fertilization method has no effect, the probability of Fisher and colleagues obtaining their specific yield measurements is just five percent. Fisher is careful to tell his colleagues that the 0.05 p-value does not mean there is a 95% chance that the fertilization method has an effect on the growth of grass. It just means their observed yield measurements would be very unlikely if that weren’t the case.

Almost a decade after the publication of Statistical Methods for Research Workers, Jerzy Neyman and Egon Pearson address what they view as a fundamental asymmetry in Fisher’s framework. Namely, though it’s intended to help researchers evaluate the results of experiments, it doesn’t say anything about experimental hypotheses. To return one last time to our practical example, Fisher’s framework does not allow him to directly evaluate an alternative to the null hypothesis. Fisher is not able to tell his colleagues that their observed yield measurements support a hypothesis that the fertilization method is associated with greater grass yield. In the decades to come, Fisher will vehemently protest the addition of such experimental hypotheses to his framework and methods.

Though their “hypothesis testing” framework draws heavily from Fisher’s, Neyman and Pearson’s has a fundamentally different goal. Rather than giving researchers tools to evaluate the results of agricultural experiments, their goal is determining the most optimal test for deciding between competing hypotheses. These hypotheses include Fisher’s null hypothesis, but also a variety of “alternative” or experimental hypotheses. To this end, they introduce three important concepts to the burgeoning field of research-oriented statistics: Type I Error- The probability of incorrectly rejecting the null hypothesis, Type II Error- the probability of incorrectly accepting the null hypothesis, and Power- the probability of correctly rejecting the null hypothesis correctly.

Disagreements between Fisher and Neyman and Pearson soon escalates into open antagonism. However, all three watch in dismay as elements from significance and hypothesis testing are hybridized in books intended for psychology researchers. The distinct aims and different ways the two frameworks arrange and evaluate hypotheses are quickly obscured by the emergence of an enormously and immediately influential model of statistical testing. This model incorporates Pearson’s null hypothesis, Neyman and Pearson’s alternative hypotheses, and an almost blinding focus on observing p-values less than 0.05.

2005
I am a psychology major with a minor in biology. Because it is required for my major and because I have a vague interest in the laboratory side of things, I enroll in a research methods course. The course is built around groups of students independently designing, running, and writing up their own psychology experiments.

I, like the majority of my classmates, am an extremely novice investigator. This is my first exposure to the basics of research design and multitude of difficult decisions involved in studying even the most robust of psychological phenomena. As the semester proceeds, it will also be my first exposure to the practical difficulties involved in collecting and analyzing data from human participants.

Because of the severe time constraints, almost every group decides to replicate a previously conducted experiment. Because the details of such experiments are easily accessible and the financial costs of running them are quite low, these replications cohere around influential social psychology phenomena like social priming and ego depletion.

The social priming replicators run versions of a wildly influential experiment conducted by social psychologist John Bargh and colleagues in 1996. Bargh and colleagues found that exposing participants to words associated with a particular trait caused them to exhibit behaviors associated with that trait; participants shown words related to rudeness were more likely to interrupt the experimenter than participants shown words related to politeness, participants shown words related to the elderly walked at a slower pace than participants shown non-age-specific words. Later experiments would apply this principle very broadly to investigate everything from social norms and stereotypes to social behaviors and emotions. Back in my research methods course, I observe my classmates timing their friends as they walk down hallways and across the campus green.

I eagerly participate in my classmates’ attempts to replicate Roy Baumeister and colleagues’ 1998 study of ego depletion. In their initial study, Baumeister and colleagues found that participants who forced themselves to eat a radish rather than a tempting piece of chocolate were quicker to quit an impossible puzzle task than the participants who ate the chocolate. Later experiments would apply this design to study the limited nature of self control in contexts ranging from aggression and fatigue to decision making and social rejection. In the context of my research methods course, I eat a lot of chocolate and spend a lot of time trying to do impossible puzzle tasks.

I am disappointed to find that my group’s priming experiment doesn’t work. Like the initial 1996 study, we collect data from about thirty participants. Unlike the initial study, we find that participants shown words like “retired” and “Florida” walk, on average, at the same pace as participants shown words like “clean” and “private”. I run a t-test and the resulting p-value is well above 0.05.

My disappointment grows at the end of the semester. I learn that, though our failure to replicate could be due to our inexperience or to the fragility of the social priming phenomena, it is also not altogether unsurprising. I learn that many research laboratories include a file drawer full of data that failed to reject the null hypothesis or replicate a previous finding. I also learn that, while our results will have no impact on my final grade, several groups showing particularly novel findings and particularly small p-values have, in years past, published articles in real academic journals.

1959
T.D. Sterling, a statistician at the University of Cincinnati, combs through several hundred psychology articles and makes several startling observations. Not only do most use the hybridized model of statistical testing that so confounded Ronald Fisher, Jerzy Neyman, and Egon Pearson, but almost all report a rejection of the null hypothesis using Fisher’s p<0.05 criterion. Furthermore, he finds no published attempts to independently replicate a previously conducted study.

Sterling warns that this “publication policy” greatly amplifies the possibility that a Type I error will persist in the literature uncontested. Recall that, within Fisher’s framework, p-values do not reflect that probability that the null hypothesis is true. Rather, they reflect the probability of observing a given pattern of data under the assumption that the null hypothesis is true. Emphasizing positive results makes it more likely that an improbable event will erroneously be presented as rejection of the null hypothesis. Disregarding independent replication makes it more likely that such an error will never be corrected. Though his warning about the preponderance of “false positives” is directed primarily towards them, Sterling’s warning is not restricted to researchers working in the field of psychology.

“It would be unfair to close with the impression that the malpractices discussed here are the private domain of psychology. A few minutes of browsing through experimental journals in biology, chemistry, medicine, physiology, or sociology show that the same usages are widespread through other sciences.”

Working independently from Sterling, Jacob Cohen, a quantitative psychologist at New York University, discovers power analysis. Recall that, within Neyman and Pearson’s framework, power refers to the probability of a test correctly rejecting the null hypothesis. Power is derived from three factors: sample size (the number of data points collected in a study), effect size (the magnitude of the phenomena being studied), and the significance criterion (e.g. p<0.05). Combing through almost 80 articles published in a high profile psychology journal, Cohen determines that, while many show “statistically significant” results, they are almost universally underpowered.

The most obvious consequence of a lack of statistical power is an increased probability of making a Type II error. If the probability of a test correctly rejecting the null hypothesis is decreased, it follows the that the probability of incorrectly accepting the null hypothesis would be increased. Such “false negatives” are problem for a field that is often concerned with studying effects of relatively small or unknown sizes. However, as Sterling demonstrated, null results don’t often make it into the psychology literature, whether they are correct or not. Underpowered studies can thus result in a double disappointment for researchers, not only do they fail to find the phenomena they’re looking for, but they also have nothing to show for it.

A more insidious consequence of a lack of statistical power is that the lower a study’s power, the lower the probability that a “positive” finding actually reflects the presence of a true effect. Taken together, Sterling and Cohen’s findings indicate that the psychology literature is distorted in favor of studies with p-values less than 0.05 that may not even be demonstrating real effects.

The research community quickly incorporates Cohen’s methods for assessing power. Conducting a power analyses to determine the appropriate sample size to study a phenomena becomes a de-rigueur part of designing an experiment. Unfortunately, this does not translate into a subsequent increase in power in the psychological literature.

Follow-ups to Sterling and Cohen’s literature analyses in the decades to come will find the situation largely unchanged. Two decades after Sterling’s initial analysis, Robert Rosenthal gives the situation it’s now familiar name: “The file drawer problem”. “Methodological paradoxes” and “exaggerated beliefs” persist in the psychological literature, but as the distortions described by Sterling and Cohen incite a crisis that will eventually engulf the field, a new phrase enters the lexicon of science: p-hacking.

2007
I am in my faculty advisor’s office. He’s a clinical psychologist but, as usual, I only want to talk about neuroscience and the biological basis of human behavior. I tell him I’m interested in research. He’s excited, but then concerned. Fascinated by the dazzling images it produced, I tell him I want to study the biology of memory with fMRI.

I have no concept of the difficulties involved in doing research with fMRI. As I sit there, I know only that it’s a technique that measures brain activity and that it produces cool looking images. It will be years until I learn that the signal analyzed in fMRI studies is actually several steps removed from the biochemical activity of the brain and that its images are largely the result of running thousands of simultaneous t-tests or ANOVAs. Regardless, my advisor presents a much more immediate obstacle: Nobody at the university does fMRI research. In fact, there are only two labs on campus that explicitly research the relationship between biology and behavior: A biology lab that investigates taste perception in fruit flies and a social psychology lab that studies the subjective experience of emotion.

After spending a memorable summer feeding, rearing, and cleaning up after the flies, I decide I’d rather work with the social psychologists.

Over the next year, I design a study to investigate the biology underlying placebo effects. It’s pretty far from researching memory, but I figure studying the biology underlying a phenomenon that results in chemically inert substances having real will look good on my grad school applications.

The lab is equipped with the hardware necessary to collect physiological measures like heart rate and blood pressure. My initial design is simple: To characterize the physiology associated with placebo effects, I propose to give research participants decaffeinated coffee, tell half of them it’s caffeinated, and then investigate if there’s any detectable physiological differences between the two groups. Because the lab is primarily interested in the in the cues people use to define their emotional experiences, I add in a “facial feedback” manipulation where I ask people to smile, frown, and then record how they feel. At the suggestion of a graduate student in the lab, I decide to have participants watch a scary movie clip as I record their heart rate.

I watch the same scene from “Silence of the Lambs” dozens of times over the course of the experiment. The scene is meant to elicit fear, but what keeps me up at night is distress about my progress. The “facial feedback” manipulation is designed to separate my study participants into two groups- those that define their emotional experience using situational cues like a scary movie and those that define their emotional experience using internal cues like heart rate. This is a technique pioneered in the lab and they’ve applied it to study everything from social conformity to pre-menstrual syndrome. I am never able to get it to work properly and my already ramshackle experiment ends up completely unbalanced. I send a series of barely coherent e-mails to the head of the lab. In a room containing an ice-bath used to test pain perception, I feel like I’m drowning. I worry so much about the quickness of my heart rate that I repeatedly use the lab equipment to check it. I send a series of barely coherent e-mails to the head of the lab. I have what I’ll later come to understand is a panic attack.

The data collection phase of my troubled experiment ends around the same time I learn I’ve been accepted to a PhD program in biological psychology. My frustration and anxiety fades with the knowledge that I will soon finally be doing research with fMRI. Excited on my behalf, and eager to discuss it’s implications for the forthcoming presidential election, my advisor forwards me an article which purports to show differences in the brains of Democrats and Republicans using fMRI. I will soon learn that this article is rife with reverse inference- an error in reasoning that is both widespread and widely critiqued in the fMRI literature.

1962
Psychologists Stanley Schachter and Jerome Singer engage in a bit of deception to test a long standing theory about the nature of emotion. Under the guise of testing the effects of a new vitamin on vision, they inject research subjects with either “Suprexone” (actually adrenaline) or an inert placebo. Some of the participants injected with adrenaline are told the injection will result in a pounding heart rate and flushed face, some are told the injection may result in numbed feet and a general itchy feeling, some are told nothing.

Each participant then waits in a room with another person (actually a member of the research team) expressing either euphoria or fury. Participants not given another explanation for the pounding heart rate and flushed face brought on by the adrenaline injection tend to handwave them as a reaction to these outbursts.

James Laird, a PhD student at the University of Rochester, attempts to reconcile this work with an assertion made by Charles Darwin more than a century earlier, that facial expressions can amplify the feelings of an emotion. To do this, he too engages in a bit of deception.

Telling his research participants he wants to record their facial muscles under various conditions, Laird attaches electrodes to the corner of their mouths, the edges of their jaws, and the space between their eyebrows. Carefully, he then asks them to contract their muscles until they are either smiling or frowning. He then asks them to record their emotional experience as they viewed either a photograph or a cartoon. After analyzing the data, he finds a small but statistically significant effect. Participants tricked into smiling rated their experience more positively and those who tricked into frowning more negatively. One participant articulates Laird’s hypothesis almost verbatim.

“When my jaw was clenched and my brows down, I tried not to be angry but it just fit the position. I’m not in any angry mood but I found my thoughts wandering to things that made me angry, which is sort of silly I guess. I knew I was in an experiment and knew I had no reason to feel that way, but I just lost control.”

Laird revisits this experiment throughout his career, using the manipulation of facial expressions to separate research participants into two groups: those who define their emotional experience based on personal cues and those who define it based on situational situational cues. His lab will eventually extend their investigation to measures of pain tolerance and physiological measures like heart rate and blood pressure.

Almost a decade later, Fritz Strack, a post-doc at the University of Illinois conducts the canonical experiment on facial feedback. The key difference from Laird’s experiment is the use of a pen. Though Laird initially deceived his participants about the true nature of his experiment, almost a fifth figured it out midway through. By asking participants to hold pens between their teeth Strack and his colleagues could trick participants in smiling without their knowledge. By asking asking participants to hold pens between their lips, they could trick them into frowning. Similar to Laird they find a relatively small but significant effect. Participants tricked into smiling rated their emotional experience after viewing a cartoon more positively than participants tricked into frowning.

The study body of work generated by Schachter, Singer, Laird, and Strack influences a huge amount of research into emotion, self control, and social behavior. Strack and colleagues’ work has even been proposed as a treatment for mood disorders like depression. Unfortunately, their results have not gone uncontested. In 1979, attempts to directly and conceptually replicate Schacter and Singer’s initial “Suprexone” study fail to find the same result. In 2016, an effort involving seventeen labs and over 1800 participants fails to replicate Strack’s “facial feedback” study.

2009
I am a second year PhD student at a large state university. I am finally working in a lab that studies memory using fMRI. Because the research methods courses offered through my program are again largely built around learning to apply statistical tests in SPSS, I primarily learn fMRI research design from my labmates. On the door of our advisor’s office is an image that appears to show activity in the brain of a dead salmon. Less morbid than it sounds, I will eventually learn that this is a particularly dramatic illustration of what can happen when fMRI researchers fail to correct for multiple comparisons.

Like T.D. Sterling’s “publication policy” and Jacob Cohen’s discovery of a rampant lack of statistical power, the problem of multiple comparisons is another source of error in the psychological literature. Recall again that, in Fisher’s significance testing framework, a “positive” result does not necessarily indicate that the null hypothesis is false, but rather that the observed data would be highly improbable if that were not the case. Unfortunately, when multiple t-tests or ANOVAs are applied simultaneously, the probability of detecting improbable events is increased. For an incautious researcher, failing to correct for multiple comparisons can greatly increase the chances of making a Type I error. Because fMRI analyses involve running thousands of tests at the same time, such an error can appear as ludicrous as brain activity in a dead salmon.

I attend an emergency journal club meeting convened to discuss a paper initially titled “Voodoo Correlations in Social Neuroscience”. Authored by Ed Vul and colleagues, the paper asserts that the correlations between brain activity and personality measurements reported throughout the neuroscientific and psychological literature are greatly inflated. As a novice graduate student, I eagerly take notes as we discuss the reliability of fMRI, personality questionnaires, and analyses that incorporate both. I close my notebook when the conversation grows more heated. The point of contention is not the problem described by the paper, which is related to the problems of multiple comparisons, but that it names names. A particularly controversial figure lists specific papers that include statistical “voodoo”. The most well known fMRI researcher in the program is an author on four such papers.

Though my own experiments are relatively unaffected, I become increasingly anxious about their validity. I spend weeks working with a set of fMRI data, then months worrying that my analyses are somehow fraudulent. Though the academic conversation around “voodoo correlations” is immediate, productive, and indicative of a field moving forward not backward, it casts a long shadow over how fMRI is perceived- both by people looking at the technique from the outside and by me.

2011
Daryl Bem, an influential social psychologist at Cornell University, authors a study positing the existence of extra-sensory perception. Beyond its doubly provocative conclusion that ESP exists and can be observed as a research participants view pornographic images, this study is notable for its presentation by a respected researcher in a highly respected psychology journal. As Bem discusses his study on MSNBC and The Colbert Report, emergency journal clubs are convened to address not what it means for human perception, but what it means means for psychology’s research methods. Though later analyses will show some irregularities in Bem’s p-values, his statistical practices are also far from unusual in the psychological literature.

As Bem’s study causes psychologists to reflect on their use of statistics, several highly visible incidents involving social priming causes them to question their hypotheses. After attempting to replicate the initial 1996 social priming study, social psychologist Stéphane Doyen and colleagues assert that, not only do participants shown words like “Retired” and “Wrinkled” walk at the same speed as participants shown words like “Empirical” and “Small”, but that social priming effects are largely driven by researcher expectations. It’s an incendiary statement, that a hugely influential psychological phenomenon owes its existence to sloppy research practices, and the ensuing debate will go on for years.

In parallel to debates about psychology’s statistical methods and research hypotheses, Diederik Stapel, a Dutch social psychologist whose work often invokes social priming, is found to have perpetuated an unprecedented amount of scientific fraud. Stapel will eventually admit to fabricating and inappropriately manipulating data reported in over 50 publications.

As doubts about the integrity of psychological research grow, Daniel Kahnnamen, an influential cognitive psychologist and one of the founders of behavioral economics, pens an open letter to priming researchers. Kahnnamen, whose book Thinking Fast and Thinking Slow trumpets recent work on social priming, proposes a collaboration wherein a number laboratories work together to propose, test, and validate experimental results. Such a model will be quickly be implemented by the The Reproducibility Project: Psychology, a large-scale effort to estimate the reproducibility of psychological science. The results are not reassuring. A substantial proportion of the attempts to replicate studies published in well regarded psychology journals fail, despite employing the same materials and a high level of statistical power. Despite these disheartening results, the publication describing them ends on a positive note.

“The present results suggest that there is room to improve reproducibility in psychology. Any temptation to interpret these results as a defeat for psychology, or science more generally, must contend with the fact that this project demonstrates science behaving as it should.”

As influential psychological models like ego-depletion also get swept up in the wave of failed replications, conversations about psychology in both the scholarly and popular press increasingly reference p-hacking, “false positive psychology”, and “researcher degrees of freedom”. Though these terms are not strictly synonymous, they all describe behaviors that increase the probability of erroneous results entering the psychological literature. After eighty years of debate about the utility of Fisher’s p<0.05 criterion, the effects of Sterling’s “publication policy”, and the overall integrity of psychological research, the trauma of the ongoing “replication crisis” causes the research community to begin re-evaluating its methods, priorities, and assumptions.

2016
I am a postdoctoral fellow, though in a library not a laboratory. Since finishing grad school, I have re-oriented my career away from placebos, memory research, and brain imaging and towards fostering better communication and reproducibility in science.

Sitting in my cubicle, I read a paper that details how the psychology’s replication problems are nothing compared to what is seen in brain imaging. Immediately, I feel as though my whole academic career, from the first quasi-experiment I ever conducted all the way to my dissertation research, have been subsumed by the replication crisis.

In despair, I begin to research the history of p-values. I learn that that the replication crisis is only the most recent and most dramatic manifestation of nearly a century of debate within psychology. I learn that debates about hacked p-values, file-drawer problems, and voodoo correlations are growing pains, evidence that the science of psychology is becoming more mature. Finally, after nearly a decade of anxiety, I feel vindicated in my inability to make that cursed “facial feedback” manipulation work properly.