Why does the brain have a reward prediction error?

Mark Humphries
The Spike
Published in
10 min readFeb 11, 2019

--

Dopamine, and the art of feedback

Credit: Pixabay

A deep success story of modern neuroscience is the theory that dopamine neurons signal a prediction error, the error between what reward you expected and what you got.

Its success runs deep. It has been supported by converging evidence from the firing of neurons, the release of dopamine, and the blood-flow seen in fMRI. That evidence has been gathered across diverse species, from humans, monkeys, rats, and bees. There is even causal evidence that forcing dopamine neurons to fire sends error signals in the brain, effects we can see in the behaviour of the animals whose dopamine neurons are being toyed with. The theory bridges data from the scale of human behaviour down to the level of single neurons. Unlike many theories for the brain, this one is properly computational, and makes multiple, non-trivial predictions that have turned out to be true. Dopamine and errors in predictions are intimately intertwined.

But this intimate link raises a bigger but rarely articulated question. It’s perfectly possible to build a brain that learns from errors without any explicit representation of that error in the brain. So why does the brain have an error signal for rewards at all?

To understand that question, first we need to know a little about the prediction error theory itself. The theory says that dopamine neurons fire to unexpectedly good things. If I suddenly tap you on the shoulder and hand you a sweet, your dopamine neurons go ping! for the sweet.

If I keep tapping you on the shoulder, and keep giving you a sweet, your dopamine neurons stop going ping for the sweet — getting a sweet is great but it is no longer unexpected (and frankly you’d rather I respected your personal space a bit more). Instead, the dopamine neurons go ping! for the tap itself. This is the clever bit: the neuron goes ping! because the tap on the shoulder now reliably predicts a sweet is coming (a good thing), but it’s unexpected because you don’t know when the tap is coming — so the tap on the shoulder becomes the unexpectedly good thing.

The theory also says that dopamine neurons, like people, are deeply upset by their routines being violated. Having established this relationship of trust — me repeatedly tapping you on the shoulder and you at least getting a series of sweets out of this violation of social norms — what happens if I tap you on the shoulder and then don’t give you a sweet? Your dopamine neurons then shut down completely, stop firing at all for a brief period.

In short, dopamine neurons send a rapid signal that covers all three possible errors in predicting a reward: that the reward was better than expected (a positive error); that the reward was exactly as expected (no error); or that the reward was less than expected (a negative error). We can label all this using one of those torturous compound nouns beloved of scientists: dopamine neurons send a reward prediction error.

This correspondence between dopamine and “reward prediction error” has its roots in the branch of AI called reinforcement learning (well, technically, it’s a branch of machine learning, but as everything is now labelled AI, including a FitBit which I’m pretty sure is just an accelerometer with a strap, then AI it is). Reinforcement learning is the accumulation of algorithms for how something can learn from being told only how wrong or right its own predictions were.

All of the classic algorithms of reinforcement learning have an explicit signal for the error in predicting how valuable a choice will be (where the roll call of algorithms include bandits, Temporal Difference learning, Q learning, SARSA, or Actor-Critic). This is the signal between the predicted value of what happens next, and the actual value of what happens next — where value is measured by the expected amount of future reward. The magic of reinforcement learning is that by simply minimising this error between the predicted and actual value of each next thing in the world, an artificial agent can learn remarkably complex sequences of events, like navigating across a world, or how to run.

And this is the computational part of the dopamine theory: that the rapid responses of dopamine neurons just are the prediction error of reinforcement learning algorithms. That they are the error between the predicted and actual value of what happens next. And that they are used to learn. The key to this theory is not just that the dopamine neurons signal the difference between what reward you got and what you expected. It is that they also transfer that signal to unexpected things that predict reward, exactly as reinforcement learning algorithms say they should.

This is not to say that dopamine neuron are only encoding this prediction error. There are many nuances to what dopamine neurons themselves may be interested in, a super-set of things beyond prediction errors. And indeed errors in predicting reward are but a sub-set of the possible errors in predictions about the world that could exist in the brain (a story for next time). But that dopamine neurons encode an error in predicting reward seems a well-established part of what they do.

(And this proposed correspondence between the rapid response of dopamine neurons and a prediction error is true of more elaborate reinforcement learning algorithms too, such as the exciting revival and extension of Peter Dayan’s “successor representation” idea by Sam Gershman, Ida Momennejad, Kim Stachenfeld and colleagues. In the successor representation account, there is not one simple error between what you predicted and what you got, but a whole vector of errors about predictions for changes to different features of the world — one of which is reward. A recent paper from Gershman and colleagues shows how thinking of the rapid dopamine neuron response as the sum of those errors can explain some perplexing recent findings about dopamine neurons sending rapid signals to changes in the world that aren’t reward.)

But there was no need for this correspondence twixt neuron and theoretical error signal to exist. The algorithms of reinforcement learning are based on observations of animal behaviour. And they can be very successful: animals, including humans, often really do behave like they are using a prediction error in reward to learn about the world. But just because we can describe behaviour using an error in prediction about reward, it does not follow that there has to be such an explicit error signal in the brain

For it is perfectly possible to construct a system that learns about the world using feedback that does not have an explicit signal for the error in its predictions. One example of these systems is a Bayesian agent, one that learns about the probabilities of things in the future, rather than certainties.

Such a Bayesian agent might represent the uncertainty about what the value of taking action A will be. This uncertainty will be encoded by a probability distribution — which we might write P(value|action A) — for the possible values of taking action A. For example, there might be a high probability that taking action A will have a low value, and a low probability it will have a high value; or vice-versa; or something far more complicated.

We plonk our poor Bayesian agent in the dullest world imaginable. Its entire life consists of choosing which of three levers it should pull in order to win a coin, over and over again. As the chances of winning a coin are different between the three levers, so the agent has to work out which to pull in order to get the most coins in the long run. Three levers, so three possible actions, so three corresponding probability distributions for the value of each lever. Each round the agent picks a lever based on those probability distributions — perhaps it tends to pick the one which currently gives the highest probability of the largest reward — and watches for the coin.

Coin or not, the agent uses the outcome to update its probability distribution. A coin is evidence that the lever is good, so the agent increases the probability that pulling the lever has a high value; no coin is evidence that the lever is no good, so the agent increases the probability that pulling the lever has a low value. Either way, the agent now has more information about the action it chose, regardless of whether it was a good outcome or a bad outcome. The probability distribution for that action is updated to reflect that information by changing the parameters of the distribution.

There is no error signal. The agent is learning from feedback about the world, and can use its learning to make decisions, but has no prediction error signal. Sure, we could construct one — by computing the difference between the probability distributions before and after the coin arrived — but we don’t need one. The error signal is implicit.

Again, this is behaviour, not yet the brain. But many believe the brain represents the world using probability distributions; and there are plausible theories for how to represent and update probability distributions using neurons. These boil down to adjusting the firing of the population of neurons representing a probability distribution. And you do that by adjusting the strengths of the inputs to those neurons (whether those inputs be from within the population or outside it). So the brain just needs a signal about whether or not a reward occurred, and use that to adjust connections. No complicated signal about the error in predictions is needed.

So a brain could learn from reinforcement with or without an explicit signal for errors in predicting that reinforcement. But the brain does have an explicit error signal encoded by dopamine neurons. What does this tell us?

I think it tells us three interesting ideas for how the brain works. I think — fully prepared to be wrong about this, and for there to be a water-tight argument for why you can’t build a brain without an explicit signal for errors in predicting reward.

The first idea is that the existence of an explicit error signal implies the existence of a simple representation of the world in the brain. A so-named “model-free” representation that does not represent every possible outcome of an action, and likely does not use probability either. A quickly accessible look-up table of the values of actions, that is used to choose actions when time is pressing or the world is unchanging. We already have some good ideas of where such representations live in the brain. And all forms of such simple representations we know about require an explicit signal for the error between actual and predicted values.

A second idea is that what is one concept in reinforcement learning is actually two processes in the brain. The one concept in reinforcement learning is that you use the error in your prediction to change your estimate of an action’s value. Why is this two processes in the brain? Because the brain might want to separately control short-term and long-term changes in the estimates of an action’s value. And having an explicit error signal carried by dopamine lets it do both with one signal.

To get long-term changes we could adjust our estimate of an action’s value by changing up or down the strength of connections onto neurons representing that action. Adjusting our estimate of value in this way changes long-term behaviour. And the rapid dopamine signal is indeed thought to control whether and in which direction some connections in the brain are allowed to change their strengths. Here you need the sign of the error signal to tell the connections which direction to change in.

But the brain doesn’t necessarily want each and every bit of feedback it gets to change a connection between neurons. For that locks it into a path from which it might be difficult to recover. Indeed, when we try and change the strengths of these connections ourselves, by stimulating the inputs to a neuron, some of them can prove remarkably difficult to shift. Which raises the possibility that, in the short-term, the brain may want to hedge its bets, by changing its estimates of an action’s value without changing any connection strengths. And it can do this by instead changing how responsive neurons are to their inputs. If you make the neuron for action A more likely to fire, then you’ve increased its predicted value; and vice-versa. Guess which transmitter in the brain has many hundreds of papers showing it changes the responsiveness of neurons that control action? Yep, dopamine.

Put together, the argument here is that the explicit error signal exists to allow the brain to control changes of predicted value on two time-scales. And do that using one error signal coded by dopamine: to both allow changing connection strengths in the long-term, and change how responsive neurons are in the short-term.

The third idea is that an explicit error signal in the brain is evolutionary happenstance. Building a system to learn from feedback is easier with an explicit error signal than with representations of probabilities across a group of neurons. Ancient animals likely had a neuron or two that spritzes dopamine, or something similar, as part of their control of movement. We can find plenty of invertebrates with a just a few thousand neurons in which dopamine alters movement by changing the ways neurons respond to their inputs. With this dopamine system in place, perhaps the path of least resistance for evolution was to co-opt this broadcast signal to change the coupling between neurons following an error. Which seems potentially easier than, from the same crude beginnings, first evolving a distributed system for representing information that does not require an explicit error signal.

The contributions of theory to neuroscience are as much about showing what the brain doesn’t or can’t do, as what it can do. Yes, if we allow any arbitrary idea, this space is practically infinite: theories showing that the brain doesn’t use strawberry jelly as a neurotransmitter, or doesn’t compute using the back of an envelope and a blunt pencil are not useful.

But here we find an explicit error signal in the brain, and that rules out a whole class of ways of doing learning from feedback, and rules some in. The reward prediction error theory of dopamine tells us as much by what it doesn’t do, as what does. In the garden of forking paths, we should be glad of some help — and few garden paths are more forking complicated than the brain.

Want more? Follow us at The Spike

Twitter: @markdhumphries

Enjoyed this story? Then consider signing up to become a Medium member: $5 a month gives you unlimited access to all stories on Medium, and supports all their writers. If you sign up using my link, I’ll earn a small commission: https://drmdhumphries.medium.com/membership

--

--

Mark Humphries
The Spike

Theorist & neuroscientist. Writing at the intersection of neurons, data science, and AI. Author of “The Spike: An Epic Journey Through the Brain in 2.1 Seconds”