Intuitions on predictive coding and the free energy principle

7 min readJun 28, 2018

Intuitions on predictive coding and the free energy principle

This non-technical post is on theoretical neuroscience and aims to give simple visual intuitions without entering into the underlying math. If you would like to get a more thorough treatment of free energy, check out this newer post:

Tutorial on Active Inference

Active inference is the Free Energy principle of the brain applied to action. It is supported by lots of experimental…

medium.com

Our perception largely depends on what we expect to perceive. We expect that coffee is hot, so we take the first sips slowly. We expect that subway trains move, so after entering the carriage, we apply extra force to counteract the increasing acceleration. Indeed, it would be inefficient (and dumb) for our brain to just passively encode the massive stream of sensory information. This intuition stands behind the predictive coding model, that came in the late 90s. A few years later, it turned out, that predictive coding naturally emerges as a model of perception under a unified brain theory — The Free Energy Principle. Unfortunately, academic formalism often assumes familiarity with certain theoretical concepts, leaving a non-expert reader without the intuitive grasp of core ideas. In what follows, is an informal introduction to predictive coding and the free energy principle, inspired by a recent tutorial article.

Suppose you've just bought a new camera and went for a forest promenade. After some time, you suddenly see a wonderful bird flying above in the trees. But…right before you press the shutter button, the bird disappears in the cloud. As you start sinking in frustration, you hear some weak chips, but strangely, from a slightly different location.

So where is the bird? You have to prepare the camera for an epic shot, so should you trust your initial estimate based on vision or rather the new information provided by your ears? "I'd use both" you say, and get the final estimate (Posterior) by combining the initial belief (Prior) with the new observation (Likelihood). This is called Bayesian (read: probabilistic) inference, and can be summarized as follows:

Posterior ~ Prior * Likelihood

Intuitively, it seems like the Posterior estimate should be somewhere in between the Prior (initial belief) and the Likelihood (observation), within a certain distance from both:

Posterior is a combination of Prior and Likelihood. Imagine that it’s somewhere in between the two, with the distances denoted by angry smileys. So if Posterior is the same as the Prior, the corresponding distance is 0. The technical terms for these distances is ‘prediction errors’.

Now imagine squeezing this scheme to a more compact form:

We overlaid our simplistic model over a formal predictive coding scheme. Each smiley corresponds to a hypothetical neuron defined in a formal scheme. The details like weights between neurons and direction of influences are not important at this point, we will return to them later. What matters is a conceptual meaning of each neuron and the whole thing.

So it turns out…predictive coding is just a roasted Bayesian inference, served with some mathematical mayonnaise.

We figured out what predictive coding is doing —combining initial guesses with new observations to get the final estimate —now let's get some more details on how it works. First of all, we have to embrace the uncertainty —both Prior and Likelihood are not a single number but a probability distribution (i.e. instead of being absolutely certain in one value, we accept that many are possible, each with a certain probability). On one hand, Prior distribution tells us how probable the bird is to be at some position before hearing the chirp. On the other hand, Likelihood denotes how probable is the chirp to come from some position, given that the bird is at some position. In other words, it describes how likely is the bird at some position to generate this particular chirp. Note: Likelihood is a function of bird's position and not a chirp. A reasonable choice for both is normal distribution, since many things naturally follow it:

The Prior is a single normal distribution (parametrized by a mean and a variance), describing the probability of different bird positions. Likelihood, however, is a bit more tricky — we evaluate a fixed observation (chirp coming from some location) under different possible causes (bird positions). Imagine that every possible bird’s position is a mean (top of the curve) of a separate normal distribution. What the Likelihood curve shows us is: if the mean of the distribution is at this position, how likely would be to hear this specific chirp from the specific position it occurred. If there is a weird transformation between a position and the mean of the corresponding distribution — Likelihood can look very different from the canonical bell-curved shape.

So we have to somehow combine 2 distributions (remember Posterior ~ Prior * Likelihood?). Unfortunately, this is really hard in most cases. First, in order to be legit, Posterior should be normalized (our beliefs should sum up to 100%), and this is computationally challenging. Second, if there is some weird transformation between the chirp and our representation of position (imagine that sounds are corrupted by wind), Posterior will no longer be a nice, easy-to-represent normal distribution. No despair though — instead of getting the complete picture (whole Posterior distribution), it would suffice to find the point with the highest probability and take it as our best guess. So we can start with some completely random guess, and then iteratively refine it. Imagine climbing a hill with an eye mask — you cannot see where exactly is the peak, but you can probe the slope under your feet and make small steps uphill until you reach the top:

We initialize our guess at a random position, and then iteratively climb the hill by following the slope.

The cool thing is that this slope is mainly composed of the distances of our Posterior estimate from Prior and Likelihood (remember the angry smileys a.k.a. prediction errors?). This rule of refining Posterior estimate defines the arrows between neurons in the formal predictive coding scheme we saw above. You can check the full derivation (and try it yourself as an exercise).

To be complete, we are assuming that the distributions of Prior and Likelihood both have the spread (variance, width of the bell) of 1, which simplifies the calculations, and corresponds to the original model of predictive coding. But what if we are very certain about our Prior, (say, we saw exactly very the bird flew), while the observations are uncertain (the chirps are very quiet), or vice versa? To this end, the updated model contains precision weights (representing how precise/confident our beliefs are), leading to "precision weighted prediciton errors".

In this case we are more confident about our Prior than Likelihood…so prediction errors related to the Prior are more "precise". Note: precision is just inverse of the variance of normal distribution: high precision — low spread and vice versa.

To sum up, we are assuming that our Prior beliefs and the Likelihood of incoming data are distributed according to the normal distribution, and we are iteratively trying to find the most likely value from the Posterior.

The Free Energy principle

So how does this thing relate to Free Energy? Think about Free Energy as just a mathematical tool, a function that only depends on our model, observations and our guess on the Posterior, and greatly simplifies the calculations (the name comes from physics and is not about the energy as we know it). It turns out that predictive coding is nothing but a Free Energy minimization applied to perception. Actually, if we approximate the Posterior distribution with just one value (as done above)- unnormalized Posterior curve we were climbing before is equal to negative Free Energy (plus some constant). Maximizing the negative Free Energy is the same as….minimizing Free Energy (when negative number gets smaller — it’s reverse gets bigger, right?).

In this special case, the curve we were climbing before is equal to negative Free Energy plus a constant [Posterior is with a star because we haven’t normalized it, so it’s actually an “unnormalized Posterior”].

However, there is more to the story than this — there are two distinct ways to minimize Free Energy. We can do it by improving our guess about the true Posterior (shown above). But we can also do it by improving our model of the environment (shown below).

In fact, we can adjust our model to better predict the observations. In other words, we want to assign high probability to events…that are actually likely to happen. Formally, this sounds as ‘maximize the probability of observations under our model’. This can be done by adjusting the connections between neurons (synaptic weights, or just ‘weights’). Since Free Energy depends on our model, we can tweak the weights to minimize it (maximize it’s negative). Let’s look on the probability of observations in our birds example. We want to find the probability of the observation **irrespective** of the cause (probability of hearing a chirp coming from a certain direction, regardless of bird’s position). To this end, we need to sum up the Likelihood for all possible causes (all possible positions), taking into account how likely is a certain cause in the first place, or its Prior (think about summing the nasty Posterior curve which we were climbing before, for all possible bird positions), and that’s really hard. However, we can maximize the probability of observation using Free Energy (see below).

So by Free Energy minimization we not only make our guess closer to the true Posterior, but also improve our model of the environment and make the observations more likely. Imagine climbing the 'probability of observations' curve by iterating the steps shown in the previous 2 figures:

These steps alternate on different timescales — while Free Energy is minimized as much as possible through ‘Posterior approximation’ , we make only slight changes of the model (i.e. weights) at every observation. Intuitively, this is because we want our model to be ‘ good on average’, and not ‘good for current observation’.

In the next post on Active Inference, we will consider in detail why it is important for an agent to have a model that assigns high probability to observations.

The great thing of the ‘free energy trick’ is that you can approximate the Posterior distribution with…a distribution, meaning that you can quantify uncertainty instead of getting only one value. This is a key, widely used advantage of this method. Similarly, you can also use a normal distribution to approximate Posterior in predictive coding, which should work better, but makes the things more math-heavier.

Connection to neural networks and backpropagation

Just in case you are interested how the predictive coding model relates to neural networks, read on. We can extend the simple predictive coding model discussed above by adding more layers and more neurons in each layer. In such a network, Posterior estimate from the layer above the hierarchy becomes the Prior for the layer below, yielding the Hierarchical Predictive Coding networks.

Interestingly, it turns out that such predictive coding network is not too far from the modern neural networks trained via backpropagation. The basic intuition goes as follows. When we think about the neural nets, we usually imagine this:

Image from http://cs231n.github.io/neural-networks-1/

But after the information flows from the input to the output layer, it then goes backwards, carrying the errors in order to update the weighs for better predictions in the future (thus — backpropagation). If we explicitly add the error units, neural network look very similar to predictive coding networks.

There is a whole paper describing the similarities between predictive coding networks and neural nets, showing that backpropagation can be viewed as a special case of predictive coding, under some conditions. Furthermore, some deep learning architectures directly take inspiration from predictive coding, to a smaller or bigger extent.

All in all, I would encourage you to go through the Rafal Bogacz's tutorial, in which the mathematical details of predictive coding are presented in a highly accessible manner (and with MATLAB code). I am grateful to MOCOCO students, Andrea Alamia, Vincent Moens and Alex Zénon, whose feedback significantly improved this post.