fMRI clusterf-…issues: What’s it all about??

Christopher R. Madan, PhD
4 min readFeb 18, 2017

--

My main training is in cognitive psychology, though I also use cognitive neuroscience approaches (including fMRI). You may have seen headlines like “Has a software bug really called decades of brain imaging research into question?” (The Guardian) or “Faulty Statistics Muddy fMRI Results” (The Scientist), or even “Do You Believe in God, or Is That a Software Glitch?” (Wired). Are things really that bad? No. But, many of my cognitive psychology colleagues that don’t do fMRI work have been asking me about this, so I figured I’d write a short blog about it.

Briefly, in the original PNAS paper, “Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates”, Eklund et al. demonstrate that there are issues with the assumptions used in standard fMRI statistical analysis packages. Their point is valid. I think the ‘bigger’ issue though, is the scale that the problem is being portrayed as being. The original version of the paper stated:

These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results

…and that’s where the media ran away with it. The authors have since issued an official correction, and prior to that wrote a blog explaining their oversight in making this broad claim and explained a more accurate estimation.

However, there is one number I regret: 40,000. In trying to refer to the importance of the fMRI discipline, we used an estimate of the entire fMRI literature as number of studies impinged by our findings. In our defense, we found problems with cluster size inference in general (severe for P=0.01 CDT, biased for P=0.001), the dominant inference method, suggesting the majority of the literature was affected. The number in the impact statement, however, has been picked up by popular press and fed a small twitterstorm. Hence, I feel it’s my duty to make at least a rough estimate of “How many articles does our work affect?”. I’m not a bibliometrician, and this really a rough-and-ready exercise, but it hopefully gives a sense of the order of magnitude of the problem.
[…]
So, are we saying 3,500 papers are “wrong”? It depends. Our results suggest CDT P=0.01 results have inflated P-values, but each study must be examined… if the effects are really strong, it likely doesn’t matter if the P-values are biased, and the scientific inference will remain unchanged. But if the effects are really weak, then the results might indeed be consistent with noise. And, what about those 13,000 papers with no correction, especially common in the earlier literature? No, they shouldn’t be discarded out of hand either, but a particularly jaded eye is needed for those works, especially when comparing them to new references with improved methodological standards.

So, not all fMRI studies are flawed. Studies that relied on constraining their analyses to a priori anatomical regions-of-interest are unaffected. But even of the studies that are affected, it is likely that the main results of a paper would persist. In other words, if the main changes in brain activity are large enough clusters, they would still be there even with the correct, higher statistical threshold. Some additional clusters may not be, but in many cases they aren’t central to the hypotheses/narrative of the paper anyway. There are definitely papers that would not survive this corrected statistical threshold, but a critical eye, such as those of researchers within the relevant subfield of cognitive neuroscience, likely would’ve noticed that the cluster was not particularly large and would view it with caution anyway. For a proper treatment of this argument, see “Which Findings from the Functional Neuroimaging Literature Can We Trust?

Even more recently, Robert Cox and colleagues, the developers of a widely used fMRI analysis program (AFNI), published a response letter in PNAS that provides a number of follow-up analyses, and ends with the statement:

We strongly disagree with [Eklund et al.]’s summary statement, “Alarmingly, the parametric methods can give a very high degree of false positives (up to 70%, compared with the nominal 5%).” […] By concentrating on the highest observed [false positive rates], the conclusions of Eklund et al. were unnecessarily alarmist.

(For an even more detailed follow up by Robert Cox et al., see here.)

In the future, what can we do better? Sharing unthresholded statistical maps would be a good step (e.g., in NeuroVault). But always, thinking critically about the assumptions being made is always important.

Statistical methods are continuously improving, we aren’t marred by decades of bad research— we are informed by prior work with less precise equipment and less rigorous methods — and we move forward and advance the field. …same as any other field of science.

--

--

Christopher R. Madan, PhD

Assistant Professor at the University of Nottingham, Psychology. Computational cognitive neuroscience. Memory; motivated cognition; brain morphology.