Can brain scans transform psychiatry?

Iris Proff
9 min readNov 17, 2020

--

Diagnosing mental disorders by looking at the brain is the dream of many neuroscientists. But is brain data reliable enough for medical use? Seven researchers share their perspective.

For centuries, researchers have contemplated how the brain works — which part does what? A mere 30 years ago, when functional neuroimaging (fMRI) was invented, they finally had a tool at hand allowing them to map out the human brain. Scientists found specific brain regions involved in language understanding, fear responses, reward processing and looking at faces, just to name a few. The field grew at mind-boggling speed.

Soon, people started to raise new types of questions: what can brain activity tell us about how people differ? Does region X activate stronger in individuals who are more impulsive? Such correlations were found, and this sparked a new hope. Maybe, the reasoning went, brain activity can predict if someone is in peril of becoming psychotic or which treatment is best for a depressed patient. Psychiatry has historically been plagued by a lack of biomarkers and depends solely on subjective assessment. Functional neuroimaging seemed a promising candidate to change that.

In their hunt for psychiatric biomarkers, researchers kept using the same methods that had proven useful. However, their target had shifted from the average brain to individual brains. Only recently, it became evident that this approach is flawed. Classical tasks and measures were designed to trigger the same activity in everyone. But finding out how people differ requires strategies that surface individual differences in brain activity. Classical single-region analyses of the brain, researchers from the field agree today, will not do the job. But what will?

The statistical ingredients of a biomarker

Whoever wants to develop a biomarker for a mental or physical condition needs to ensure three test-theoretical properties: high between-subject variability, test-retest reliability and construct validity.

To get a grasp of what each of these terms implies, imagine the following scenario: after an environmental disaster, you are working for an organization that provides help for affected communities. You have a bunch of volunteers signed up and a long list of roles that need to be filled. Your task is to develop a questionnaire that predicts who will thrive in which role.

What do you need to pay attention to? First, you probably want to use questions that people answer as differently as possible. A question that everyone answers in the same way will not allow you to conclude anything about the individual. Second, you want to make sure that when doing the same test twice, a week apart, the same individuals score high on the same abilities. And third, you need to ensure that your questions actually capture the abilities you are interested in. You could ask people for their year of birth and you might get diverse and reliable answers. But it will not help you if someone’s age does not predict whether they are a great manager, meticulous bookkeeper or good with their hands.

Neuroimaging biomarkers require the same three properties as your questionnaire. First, the measured brain responses need to differ between individuals. Second, the brain response of one individual at different moments in time should be consistent. And finally, our measure needs to relate to the concept we are interested in — anxiety, depression, psychosis or the like.

It turns out that classical fMRI measures do not have those properties.

Maxwell Elliott, Duke University.

Test-retest reliability of classical measures is poor

Maxwell Elliott is researcher at Duke University. Earlier this year, Elliott and his colleagues published a meta-analysis that systemically investigates the test-retest reliability of single-region fMRI experiments.

In each of the 90 experiments the researchers evaluated, subjects performed the identical task in the scanner twice — days, weeks or months apart. The results were devastating. Test-retest reliability was poor across most studies. In other words, data from one individual’s brain was only slightly more similar to itself across time than it was to another person’s brain. In 2019, a team led by Stephanie Nobel at Yale University found similar results regarding functional connectivity. The researchers tested if the strength of single connections between brain regions is consistent over time. Also here, test-retest reliability was poor.

These results sparked a hot discussion in the field: how can we claim to make predictions about an individual based on how their brain responds, if it responds differently from one day to the next? “We got very mixed reactions”, says Elliott. “People were very supportive, but some demanded to take down fMRI. That was more than we wanted.”

Ahmad Hariri, head of the research group at Duke University and Elliott’s mentor, is renowned for his research on individual differences with fMRI. In 2015, his group published a study reporting that amygdala activity in response to threat stimuli predicts someone’s vulnerability to life stress. Given the test-retest reliability issue, the relevance of these findings appears questionable today. “We cannot continue investigating individual differences using the same task-based fMRI measures, knowing what we know”, the researcher concludes and has decided for the moment to turn away from functional to structural MRI scans, which are known to be more reliable.

Tor Wager, Dartmouth College.

Measurement error or true variability?

Not everyone is as fatalistic as Hariri. Tor Wager is neuroscientist at Dartmouth College and expert on fMRI methodology. His group argues that the conclusions of Elliott and Hariri’s study overgeneralize the problem of test-retest reliability. Often, fMRI measures are not consistent, especially not over periods of weeks or months. But this is not necessarily a problem, Wager states.

There are two possible sources of low test-retest reliability. First, fMRI measures might vary from one time point to the next, simply because they are noisy. But it might as well be that the measures are inconsistent due to true variability in the signal. A subject being tired or bored, what they had for breakfast, the time of the day — all of this might change the true activity in the brain when performing a task. “What fMRI measures are dynamic brain states”, Wager says. “Many of them are not inherently stable — but that doesn’t mean they are useless! An anxiety biomarker should be high only when you’re feeling anxious.”

It is thus tempting to conclude that fMRI is simply not suited to capture stable traits of a person, such as how vulnerable they are to depression. However, there is evidence that speaks against this idea.

Single regions are not enough

Elliott and Hariri’s study was limited to the simplest analysis of fMRI data: one takes the average activity in a predefined region of interest and compares it between two conditions. “This single region approach is not the state of the art when it comes to biomarkers”, Tor Wager is convinced. In fact, recent research into psychiatric biomarkers often considers information from many different parts of the brain. Such multivariate approaches often show a much higher test-retest reliability. A group around Emily Finn found in 2015 that the functional connectivity pattern between 268 brain regions is remarkably consistent over time. Using this ‘fingerprint’ of the brain, the researchers could re-identify an individual among a crowd of 126 people with over 90 percent accuracy.

Stephanie Noble, Yale University.

In 2017, Stephanie Noble compared the reliability of this whole-brain connectivity signature against the reliability of each single connection. Even though individual connections were unreliable, the complete connectivity matrix had good test-retest reliability. “The whole appears to be more than the sum of its parts”, Noble concludes.

So why are multivariate measures more consistent over time? From a purely psychometric point of view, two numbers are better than one at measuring a trait of a person, because they are less vulnerable to noise. But there is a more interesting explanation. Given all we know about the brain, complex states are probably not captured within single regions. Even though some specific functions are localized in one area, the brain works in a radically distributed way. Differences between a depressed and a healthy brain are thus likely not visible in single regions, but rather manifest in the pattern of activity or connectivity spanning many areas of the brain.

That means, if we want to develop biomarkers from brain data, we need to look at distributed neural systems. Opinions about the best way to do that diverge.

How to develop predictive biomarkers from brain data

One camp of researchers is in favor of the machine learning approach to biomarkers. The idea is simple: you collect a bunch of data from across the brain, throw them into a machine learning classifier and out comes a prediction. This approach seems to work fairly well. In 2018, a meta-analysis evaluated twenty studies that used machine learning on brain data to predict how effective a treatment will be for depressed patients. The overall prediction accuracy was 82 percent — which is impressive given that treatments for depression are largely prescribed based on trial-and-error.

Stefan Frässle, ETH Zürich.

The other camp of researchers is sceptical towards this purely data-driven approach. “Machine learning can make predictions, but it doesn’t say anything about the underlying mechanisms”, says neuroscientist Stefan Frässle. He works in the group of Klaas Enno Stephan in Zurich, which is pioneering the young field of Computational Psychiatry.

The researchers’ approach is strictly hypothesis-driven. They use computational models that mimic the neural or behavioral processes that they suspect to underlie mental disorders. A running hypothesis is that many disorders, like schizophrenia or depression, are caused by differences in the way brain regions communicate. Therefore, Frässle and his colleagues often use so-called Dynamic Causal Models (DCM), which allow to estimate how strongly and in which direction signals are passed between predefined brain regions.

The test-retest reliability of DCM is yet to be fully established. “From a theoretical point of view, the model separates the true signal from the noise and therefore our hope is that connectivity parameters have higher test-retest reliability than the raw brain data”, says Frässle. However, reliability is not all that matters, the researcher emphasizes. “I can have a perfectly reliable measure that is useless as a biomarker when does not yield valid results or does not capture disease-related processes.” Structural scans, for instance, are generally more reliable than functional scans. But it might well be that clinically relevant information is captured in the dynamic function and not in the stable structure of the brain.

A need for new tasks

Shifting to multivariate analyses seems a promising strategy — but it might not be enough to obtain reliable biomarkers. The German researcher Juliane Fröhner stumbled upon the problem of test-retest reliability in fMRI when evaluating a study on impulsivity in adolescents over the course of four years. The degree of delay discounting — in how far someone is willing to delay a reward — is a standard measure for impulsivity. On the behavioral side, this measure is consistent over time. However, the brain response associated with delay discounting proved not to be consistent at all in Fröhner’s study. A machine learning analysis of the data did not improve its poor test-retest reliability. “If we collect data that is not consistent in time, a fancy analysis strategy can not change that”, says Vanessa Teckentrup, who was involved in the study.

Left: Juliane Fröhner, TU Dresden. Right: Vanessa Teckentrup, University of Tübingen.

Fröhner and Teckentrup have a background in psychology, where there is a long tradition of evaluating the reliability and validity of tests and questionnaires. They advocate for the development of new fMRI tasks, using the tools they know from psychology. Just as a personality test, these tasks should be designed to evoke consistent responses, which are maximally different between individuals.

What could these tasks look like? In their meta-analysis, Elliott and Hariri suggest that participants could “watch stimulus-rich movies during scanning instead of completing traditional cognitive neuroscience tasks”. When designing a test to diagnose depression, Elliott proposes, one could pick out those parts of a range of different videos that most reliably discriminate between depressed and healthy individuals.

We don’t get individual differences for free

Overall, developing valid biomarkers from fMRI will require more time, more data, more collaboration between labs and thorough work on the methods. It was tempting to think that the good old method of looking at single regions would help us understand how and why brains differ. However, given the complexity of the brain and the limitations inherent to fMRI, it might not be surprising that this is just not enough.

Or, as Maxwell Elliott puts it: “We thought we would get individual differences for free. But sadly, we don’t!”

--

--

Iris Proff

I’m a science writer. Passionate about constructive uses of technology and AI, worried about the planet, intrigued by the brain.