About that Hart (2013) retraction…
Attention to detail is critical in peer review
Retraction Watch recently ran a story about the recent retraction of “Unlocking past emotion: verb use affects mood and happiness,” by William Hart published in Psychological Science in 2013. I read this article early last year for a writing assignment that our undergraduates were asked to do. The students were asked to critique the article. My response to their assignments ended up as a blog post I titled “How to train undergraduate psychologists to be post hoc BS generators,” in which I discuss the dangers of asking students to critique work that they are not prepared to critique cogently.
When I read it, I immediately knew something was…off. Consider Experiment 1. The conclusion is that asking participants to write about negative experiences using the imperfective (“I was walking…”) or perfective (“I walked…”) aspect dramatically affects the participants’ moods, because — Hart claims — the imperfective is related to better memory for the negative event and hence more negative mood. How did Hart “measure” mood? By assessing global vs. local attention in a perceptual task. This supposedly yielded an effect of “mood” of Cohen’s d of 0.62 between perfective and imperfective for describing the negative experience (Table 1).
Now, let’s think about this. Even if we grant that the effect that Hart is looking for exists, the effect on mood is going to be very noisy, due to the variance in participants coming in, their choice of negative experience to write about, along with the many other inherent sources of noise in human experience. But then, global vs local processing is not mood; it is supposedly correlated with mood. Somehow, in the presence of all of this variability, the author managed to achieve an effect of d>0.6. One would not have expected this to work at all, but the paper is shot through with claims like these (see also Andrew Gelman’s take on the problem of noise in this sort of research; he’s had a spotlight on these sorts of issues for years).
I have never, though, seen logic that strains credibility as much as Experiment 3a. Hart had asked participants to unscramble either easy anagrams (a “positive” experience) or hard/impossible anagrams (a “negative” experience). Again, note that participants will have varying reactions to these stimuli, introducing noise into the responses. And how “negative” can unscrambling anagrams in a lab actually be?
Hart wants to establish that the effect is mediated by memory, so he introduces a measure of memory for the anagram task. Is it actually a measure of memory? No! He uses a lexical decision task. In the context of filler trials, each participant sees four trials with words related to the anagram task (“anagram,” “rearrange,” “sort,” “assemble”), and four trials with words unrelated to the anagram task (“keyboard,” “computer,” “key,” “spacebar,”). The measure of memory for the anagram task for each participant is the difference between four lexical decision trials in each of two conditions.
That’s right: Four trials in each condition.
RTs in a task like this will vary — even for similar stimuli — on the order of hundreds of milliseconds. RTs are just noisy. Add into that the fact that words themselves have variable properties (e.g., low frequency words are harder to distinguish from nonword lures) and that people will have varying associations with these words. Finally, consider that this is a lexical decision task, not a memory task! Using four trials per condition is already bad, but then we have to accept the logic that four trials per condition in a lexical decision task can supposedly stand as a measure of memory.
But, lo and behold, Hart found the result he was looking for (left). In fact, the effect on RTs is a massive 100 milliseconds! Consider that this is on the order of some Stroop effects (depending on the response format) and you’ll understand why this is simply unbelievable. Hart also claimed a significant moderated-mediation effect with memory (i.e., lexical decision time) as the mediator.
This should not have worked. Reviewers should have flagged this immediately as being ridiculous. But they didn’t, and if you understand peer review, you’ll know why. Peer review almost always happens after the results are in. What is a reviewer to say; that the author didn’t get the effects he says he did? The perverse Golden Rule of reviewing is that we don’t demand of others that which we wouldn’t want demanded of us, and certainly we wouldn’t want people to distrust us. So, we skip stuff like this in review. Experiment’s done, he says he found the effect, so….on to the Discussion section.
This is why we need to review methods before experiments are done.
The rest of the story
When I got done reading the paper, I immediately requested the data from the author. When I heard nothing, I escalated it within the University of Alabama. After many, many months with no useful response (“We’ll get back to you!”), I sent a report to Steve Lindsay at Psychological Science, who, to his credit, acted quickly and requested the data himself. The University then told him that they were going to retract the paper…and we never even had to say why we were asking for the data in the first place.
What was, and is, interesting, is the reason Hart and the University gave for the retraction: an unnamed, uncredited graduate student manipulated the data. In light of the basic implausibility of the methods, this seems incredible. An unnamed, uncredited graduate student decided to take the data from several experiments with methods that should not have worked at all, and — with no apparent motive — manipulate the data in such a way that they support the theoretical claims of their supervisor? Right.
The basic problem here is not the results, but the basic implausibility of the methods combined with the results. Presumably, the graduate student did not force Hart to measure memory using four lexical decision trials per condition. If someone claims to have hit a bullseye from 500m in hurricane-force winds with a pea-shooter, and then claims years later that a previously-unmentioned assistant faked the bullseye, you’ve got a right to look at them askance.
The essence of the problem
It seems clear to me that peer review needs an overhaul. It is unreliable, ad hoc, post hoc, lacking real teeth, and, troublingly, many scientists are opting out of performing reviews altogether, working the rest of the reviewers more. The upshot is that this sort of problematic work often gets past reviewers.
Finally, I think we need to look hard at the culture of psychological science. The culture of academic game playing — combining methods in new, but not empirically sound, ways, while trying to boster one’s clever theory — is not conducive to good science. Ring (1967) expressed the problem well in describing mid-twentieth-century experimental social psychology:
“Experimental social psychology today seems dominated by values that suggest the following slogan: ‘Social psychology ought to be and is a lot of fun.’ The fun comes not from the learning, but from the doing. Clever experimentation on exotic topics with a zany manipulation seems to be the guaranteed formula for success which, in turn, appears to be defined as being able to effect a tour de force. One sometimes gets the impression that an ever-growing coterie of social psychologists is playing (largely for one another’s benefit) a game of ‘can you top this?’ Whoever can conduct the most contrived, flamboyant, and mirth-producing experiments receives the highest score on the kudometer. There is, in short, a distinctly exhibitionistic flavor to much current experimentation, while the experimenters themselves often seem to equate notoriety with achievement.” (Ring, 1967, pp. 116–117)
When this attitude is combined with noisy methods, the garden of forking paths, and a self-serving dearth of critical appraisal, it seems experimental psychology becomes a playground with infinite possibilities — but disconnected from the truth.