Theory-based Reinforcement Learning in the Brain
A summary of Tomov et al. (2023), Neuron.
How does your brain accomplish everyday tasks such as navigating a subway network, playing a video game, or driving a car? A longstanding answer from psychology and neuroscience is that it relies on two systems: a habitual system that chooses actions reflexively based on incoming stimuli (e.g., “if I see a stop sign, I need to stop”) and a goal-directed system that chooses actions based on their consequences (e.g., “if I don’t stop at this stop sign, I could get a ticket or get in an accident; if I stop, I’ll just be a bit slower, so I should probably stop”).
A tale of two systems
The study of habitual behaviors dates back to the seminal work of Edward Thorndike who observed that cats trapped in “puzzle boxes” tend to repeat actions that lead to reward. This idea that associations between stimuli and actions are reinforced by their effects was termed the “Law of Effect” and has underpinned much of behavioral psychology in the early twentieth century.
The study of goal-directed behaviors emerged as a kind of backlash against such stimulus-response theories. Famously, Edward Tolman observed that animals can learn about an environment even in the absence of reward, such as when rats are exploring an empty maze. Such “latent learning” is unmasked when a reward (e.g., cheese) is introduced and animals very quickly learn to navigate to it, as if their behavior has been reinforced all along. Tolman proposed that the brain learns an internal representation of the environment — a kind of “cognitive map” — that supports this kind of flexible planning of behaviors that achieve different goals.
Reinforcement learning, model-free and model-based
These ideas were formalized by computer scientists into a framework known as reinforcement learning (RL). In recent decades, RL has become one of the greatest success stories in artificial intelligence and neuroscience. In the field of artificial intelligence, RL algorithms have allowed state-of-the-art systems to match and surpass humans in a number of domains, from board games to video games to algorithm discovery. In psychology and neuroscience, computational models based on RL have explained behavioral and neural phenomena across a wide range of species and experimental paradigms.
In RL, habitual behavior is formalized by model-free approaches that, through trial-and-error, learn which actions tend to lead to good outcomes in the long run and then choose actions that have been rewarding in the past. In contrast, goal-directed behavior is formalized by model-based approaches that learn which actions lead to which states; that is, they learn an internal model of the environment — a kind of cognitive map — which can be used to plan by simulating the outcomes of different courses of action.
Most of the earlier work on both model-free and model-based RL focused on simple, discrete toy domains with relatively small state spaces (also known as “tabular” domains, as all associations can be stored in a simple look-up table). While this has answered some of the fundamental questions about RL in the brain, it has left open many questions regarding how the brain handles complex, high-dimensional domains with intractably large state spaces. Neuroscientists have recently capitalized on developments in deep RL and have shown that one way the brain might deal with complex environments is by using deep function approximators (or deep neural networks). While promising, this work has focused on model-free RL only, leaving open the question of how the brain might implement model-based RL in complex, high-dimensional domains.
Intuitive theories
One possible answer to this question comes from theory-based RL. Building on ideas from developmental and cognitive psychology, theory-based RL posits that the model in model-based RL is a kind of intuitive theory: a rich, rule-based, abstract model of the world based on physical objects, intentional agents, causal interactions, and goals. This theory is learned from experience and is used to simulate, explore, and plan in the world.
In a number of studies, my colleagues Pedro Tsividis and Thomas Pouncy showed that theory-based RL can capture patterns of human learning and exploration in complex domains such as video games. Moreover, model-free deep function approximators failed on both counts, learning orders of magnitude more slowly and exploring more randomly than humans. If humans are indeed using something akin to theory-based RL to learn and plan in complex environments, we wanted to know how theory-based RL might be implemented in the brain.
A priori speculations
To answer this question, we asked human volunteers to play Atari-style video games while undergoing functional magnetic resonance imaging, or fMRI, a noninvasive brain imaging technique that measures the blood-oxygen-level-dependent (BOLD) signal across the entire brain. When neurons in a given brain region are firing, the flow of oxygen-rich blood to that region is increased, which is captured by the BOLD signal. By comparing how well different variables can explain the BOLD signal in different brain regions, researchers can test hypotheses about what the patterns of neural firing in those regions represent.
In our case, this allowed us to ask questions such as:
- What regions represent the learned theory?
- What regions respond to theory updates?
- How do theory-based representations compare to deep RL representations in explaining brain activity?
- How does information flow between these regions?
Before commencing data collection, we speculated about the answers to these questions. Previous studies have reported evidence of abstract cognitive maps in orbitofrontal cortex (OFC) — a region in the front of the brain, right above the eyes. Additionally, in previous work from our own lab, we found evidence of abstract rule learning in posterior parietal cortex (PPC)— a region in the back of the brain, right below the crown of the head.
This led to hypothesize that the brain might store theories in OFC and that it might compute theory updates in PPC, perhaps factored into updates for different theory components (objects, relations, goals) across different subregions. We also hypothesized that information would flow bottom-up, from early visual regions in occipital cortex (in the back of the brain) through theory updating regions in PPC to theory coding regions in OFC. In fact, we were so confident in these predictions that we questioned whether it is even worth doing this study in the first place.
Opening the black box
When we eventually conducted our study and analyzed the data, we were in for a surprise. Overall, it turned out that we got the high-level picture more or less right: we found evidence of theory representations in prefrontal cortex (the front of the brain) and of theory updates in posterior cortex (the back of the brain).
However, as we report in our paper, we were mistaken on the details:
- We found evidence of theory representations in inferior frontal gyrus (IFG), rather than OFC. In hindsight, IFG makes more sense in light of our prior work on causal inference in the brain, in which we identified representations of abstract causal rules in IFG.
- We found evidence of theory updates in occipital cortex and the ventral visual stream — a group of brain regions extending from the back to the bottom of the brain — rather than in PPC and the dorsal visual stream — a group of brain regions extending from the back to the top of the brain. In hindsight, this makes sense, as the ventral visual stream (also known as the “what” pathway) is associated with object recognition — i.e., what we are seeing — whereas the dorsal visual stream (also known as the “where” pathway) is associated with localization, motion, and action — i.e., where things are. Interestingly, deep model-free RL shows the opposite pattern, explaining brain activity in the dorsal stream rather than the ventral stream.
- We found evidence of “multiplexed” update signals for different theory components (objects, relations, goals) in the same regions, rather than separate update signals in different subregions.
- We found evidence that theory representations are transiently activated during updating but otherwise appear to be stored “silently” during gameplay, rather than being persistently active. In hindsight, this is also consistent with our prior work on causal inference.
- We found evidence that, during gameplay, information tends to flow top-down, from theory-coding to sensory regions, and during theory updating, the pattern flips, with information flowing bottom-up, from sensory regions to theory-coding regions. In hindsight, this would be expected under hierarchical predictive processing, according to which models of the world (in our case, the theory) in higher regions constrain sensory predictions in lower sensory regions, which in turn compute prediction errors that update the models (in our case, theory updates) in higher brain regions.
What does this mean?
Until recently, the study of model-based control in the brain was limited to small, simple toy domains that can answer particular scientific questions but generally lack ecological validity. Building on state-of-the-art advances in cognitive science and artificial intelligence, the findings from our work can serve as a foundation for studying how the brain implements model-based control in complex, naturalistic environments. We also hope our approach can serve as a blueprint for combining sophisticated end-to-end cognitive models with realistic domains such as video games. We believe that findings in such naturalistic environments have greater ecological validity and therefore have greater potential for applications to real-world problems in clinical as well as educational settings.