Seventy Teams of Scientists Analysed the Same Brain Data, and It Went Badly
To the outside observer, it can seem that fMRI research careens from one public crisis to another. Detecting brain activity in a dead salmon. Impossibly high correlations between brain activity and behaviour. Serious flaws in fMRI analysis software leading to claims of tens of thousands of papers being (partially) wrong. Finding wildly different active brain regions from the same set of fMRI scans by just varying the parameters in standard analysis pipelines. “Oof” says The Rest of Science, “glad that’s not us.”
And now a paper in Nature shows that a big groups of experts all looking at the same brain imaging data agree on almost nothing.
But within it is a warning for all of neuroscience, and beyond.
The group behind the Nature paper set a simple challenge: they asked teams of volunteers to each take the same set of fMRI scans from 108 people doing a decision-making task, and use them to test nine hypotheses of how brain activity would change during the task. Their goal was simply to test how many teams agreed on which hypotheses had significant evidence and which did not. The Neuroimaging Analysis Replication Study (NARPS) was born.
The task was simple too, cutting down on the complexity of the analysis. Lying in the scanner, you’d be shown the two potential outcomes of a coin-flip: if it comes up heads, you’d lose $X dollars; if tails, you’d win $Y dollars. Your decision is whether to accept or reject that gamble; accept it and the (virtual) coin is flipped, and your winnings adjusted accordingly. The clever bit is that the difference between the loss and win amount is varied on every trial, testing your tolerance for losing. And if you’re like most people, you have a strong aversion to losing, so will only regularly accept gambles where you could win at least twice as much as you lose.
From this simple task sprung those nine hypotheses, equally simple. Eight about how activity in a broad region of the brain should go up or down in response to wins or losses; one a comparison of changes within a brain region during wins and losses. And pretty broad regions of the brain too — a big chunk of the prefrontal cortex, the whole striatum, and the whole amygdala. Simple task, simple hypotheses, unmissably big chunks of brain — simple to get the same answer, right? Wrong.
Seventy teams stepped up to take the data and test the nine hypotheses. Of the nine, only one (Hypothesis 5) was reported as significant by more than 80% of the teams. Three were reported as significant by only about 5% of the teams, about as much as we’d expect by chance using classical statistics, so could be charitably interpreted as showing the hypotheses were not true. Which left five hypotheses in limbo, with between 20% and 35% of teams reporting a significant effect for each. Nine hypotheses: one agreed as correct; three rejected; five in limbo. Not a great scorecard for 70 teams looking at the same data.
Even worse were the predictions of how many teams would support each hypothesis. Whether made by the teams themselves, or by a group of experts not taking part, the predictions were wildly over-optimistic. The worst offender (hypothesis 2) was supported by the results of only about 25% of the teams, but its predicted support was about 75%. So not only did the teams not agree on what was true, they also couldn’t predict what was true and was not.
What then was it about the analysis pipelines used by the teams that led to big disagreements in what hypothesis was supported and what was not? The NARPS group could find little that systematically differed between them. One detectable effect was how smooth the teams made their brain maps — the more they were smoothed by averaging close-together brain bits, the more likely the team would find significant evidence for a hypotheses. But this smoothing effect only accounted for (roughly) 4% of variance in the outcomes, leaving 96% unaccounted for.
Whatever it was that differed between the teams, it came after the stage where each team built their initial statistical map of the brain’s activity, maps of which tiny cube of brain — each voxel — passed some test of significance. For these initial statistical maps of brain activity correlated quite well. So the NARPS people took a consensus of these maps across the groups, and claimed clear support for four of the hypotheses (numbers 2, 4, 5 and 6). Great: so all we need do to provide robust answers for every fMRI study is have 70 teams create maps from the same data then merge them together to find the answer. Let’s all watch the science funders line up behind that idea.
(And then some wag will run a study that tests if different teams will get the same answers from the same merged map, and around we go again).
Sarcasm aside, that is not the answer. Because the results from that consensus map did not agree with the actual results of the teams. The teams found hypotheses 1 and 3 to be significant equally as often as 2, 4 and 5, but hypothesis 1 and 3 were not well supported by the consensus map. So the polling of the teams provided different answers to the consensus of their maps. Which then are the supported hypotheses? At the end, we’re still none the wiser.
Some take pleasure in fMRI’s problems, and would add this NARPS paper to a long list of reasons not to take fMRI research seriously. But that would be folly.
Some of fMRI’s crises are more hype than substance. Finding activity in the brain of a dead salmon was not to show fMRI was broken, but was a teaching tool — an example of what could go wrong if for some reason you didn’t make the essential corrections for noise when analysing fMRI data, corrections that are built into neuroimaging analysis pipelines precisely so that you don’t find brain activity in a dead animal or, in one anecdotal case relayed to me, outside the skull. Those absurdly high “voodoo” correlations arise from double-dipping: first select out the most active voxels, and then correlate stuff only with them. The wrong thing to do, but fMRI research is hardly the only discipline that does double-dipping. And the much-ballyhooed software error turned out to maybe affect some of the results in a few hundred studies; but nonetheless was a warning to all to take care.
Everyone finds it inherently fascinating to see the activity deep within a living human brain. So fMRI studies are endlessly in the public eye, the media plastering coloured doodles of brains into their breathless reporting. But fMRI is a young field, so its growing pains are public too. Another “crisis” just broke — that when you re-scan the same person, the map of brain activity you get will likely differ quite a lot from the original scan. One crisis at a time please, fMRI. But this public coverage is not in proportion to its problems: there’s nothing special about its problems.
The fMRI analysis pipeline is fiercely complex. This is common knowledge. And because it’s common knowledge, many fMRI researchers look closely at the robustness of how fMRI data is analysed — at errors in correcting the maps of brain activity, about what happens if we don’t correct, about robustness of results to choices of how the analysis is setup, about robustness of results to having different scientists trying to obtain them. Rather than crises, one could equally interpret the above list — dead salmon, voodoo correlations and all — as a sign that fMRI is tackling its inevitable problems head on. And it just happens that they have to do it in public.
The NARPS paper ends with the warning that “although the present investigation was limited to the analysis of a single fMRI dataset, it seems highly likely that similar variability will be present for other fields of research in which the data are high-dimensional and the analysis workflows are complex and varied”.
The Rest of Science: “Do they mean us?”
Yes, they mean you. These crises should give any of us working on data from complex pipelines pause for serious thought. There is nothing unique to fMRI about the issues they raise. Other areas of neuroscience are just as bad. We can do issues of poor data collection: studies using too few subjects plague other areas of neuroscience just as much as neuroimaging. We can do absurdly high correlations too; for one thing if you use a tiny number of subjects then the correlations have to be absurdly high to pass as “significant”; for another most studies of neuron “function” are as double-dipped as fMRI studies, only analysing neurons that already passed some threshold for being tuned to the stimulus or movement studied. We can do dead salmon: without corrections for signal bleed (from the neuropil), calcium imaging can find neural activity outside of a neuron’s body. We can even do a version of this NARPS study, reaching wildly different conclusions about neural activity by varying the analysis pipelines applied to the same data-set. And the dark art of spike-sorting is, well, a dark art, with all that entails about the reliability of the findings that stem from the spikes (one solution might be: don’t sort them).
I come neither to praise fMRI nor to bury it. It’s a miraculous technology; but comes with deep limitations for anyone interested in how neurons do what they do — it records blood flow, slowly, at the resolution of millions of neurons. But all the above are crises of technique, of analysis, of statistics. They are likely common to many fields, and we should be so lucky that our field’s issues are not played out as publicly as those of fMRI. Indeed, in striving to sort out its own house, where fMRI research has gone, others should follow.
Want more? Follow us at The Spike