# Introduction to the Free-Energy Theory of Mind

One of the most important and profound questions about the mind is how it establishes correspondences between its own contents and the real world. Folk-psychology usually assumes that the mind does so through a number of basic tasks (which may or may not actually do well at describing how a real brain functions), but we can broadly separate them into the “input direction” (perception and learning), the “reasoning process” (offline conditional simulation of possible-worlds), and the “output direction” (planning and control). Through these functions, an organism’s mind perceives the world, fitting its mind to the world, and emits actions, fitting the world to its mind.

In recent times, probabilistic theories of the mind have become popular in both computational cognitive science and theoretical neuroscience. The two schools of thought sometimes reference each-other, but largely focus on different particular aspects of the probabilistic lens on the functioning of the human mind (and of embodied minds in general). Computational cognitive science’s probabilistic paradigm has focused more on nonparametric Bayesian generative models, stochastic generalizations of Turing-complete computation (probabilistic programming), behavioral evidence in lab experiments, and Monte Carlo inference. The neuroscientific *predictive processing* paradigm has focused more on variational inference, neurophysiological and neuroanatomical evidence, and information theory. However, neither of these descriptions is dichotomous: many CoCoSci papers reference neuroanatomical data, and the neuroscientific predictive processing researchers have done plenty of behavioral experiments.

Here I will mostly be focusing on the neuroscientific *predictive processing* account of the brain, including the specific forms of that account referred to as *radical* predictive processing and *action-oriented* predictive processing. I will be doing so mostly because they are somewhat more general, encompassing action and potentially memory as well as perception. I also find their information-theoretic (precision-optimizing) account important for theorizing about the limitations imposed upon, and opportunities afforded to, an embodied mind by its body and its environment. Book-length descriptions of these theories can be found in Jacob Hohwy’s *The Predictive Mind* and Andy Clark’s *Surfing Uncertainty*.

To begin, variational Bayes methods are about taking probabilistic inference and turning it into something more like a conventional supervised learning problem, with parameters, gradients, and an error signal. The parameters are the variational parameters of the approximate posterior and predictive distributions; the error signal is the *free-energy* that upper bounds the Kullback-Liebler *divergence* of the approximate posterior *from* the true posterior; the gradients are those of the free-energy functional itself.

To turn probabilistic inference into a supervised-learning problem, we start with three probability densities: p(H), p(D | H), and q(H; V). p(H) and p(D | H) form p(D, H), our *generative model* which specifies the problem to be solved via approximate inference. As a part of variational Bayes inference, we assume that the sensory signal D is sampled from p(D, H) for some value of H, and that our job is to infer probabilities over the corresponding H.

q(H; V) is our *recognition density*: an easy-to-evaluate probability density with lots of extra parameters V (variational parameters) which we can use to try to shape it into as close an approximation of the true posterior p(H | D) as possible. We can then re-condition upon H to create p(D’ | H) * q(H; V) and marginalize out H, leaving only a predictive recognition density p(D’; V) which approximates the true p(D’ | D).

In the variational-Bayesian literature in theoretical neuroscience, often referred to as predictive processing, we tend to refer to the free-energy or divergence from the posterior or predictive distribution as the *prediction error* signal. It provides our supervision signal, and minimizing it trains the model to represent the world accurately.

The idea is that by performing variational inference, we accomplish several things:

- We get a clear measurement, the prediction error, of how closely our model actually matches the available input data. Since Bayesian models with incorrect modelling assumptions or incomplete hypothesis classes can still predict new data wrongly, even with exact inference, it’s helpful to know, even approximately, how close a fit our model actually has.
- We obtain an explicitly probabilistic representation of the world, which can “consider” both the immediate sensory signal and counterfactual queries about possible worlds. This is a representational advantage over Monte Carlo approximations to Bayesian inference, as well as over discriminative machine-learning models.
- We approximate the probabilistically optimal posterior distribution as a whole, rather than at one or more points. This is a computational over Monte Carlo approximations to Bayesian inference, in line with the ability to evaluate discriminative models at arbitrary points in the feature-space.
- Computationally, there are good reasons to believe that approximating probability distributions is strictly easier than evaluating them exactly.
- The free-energy cost functional allows us to consider complexity and accuracy as separate, measurable components of how we rate an approximate model. The complexity, being a divergence, is always positive or zero, while the accuracy, being a log-probability, is always negative or zero (its negation is the
*surprisal*). Free-energy is then complexity minus accuracy (or, complexity plus surprisal); minimizing it thus only allows increased complexity to gain greater accuracy. - Note how this free-energy functional, since it trades off complexity and accuracy, thus also directly measures our model’s bias-variance tradeoff. When the accuracy is too low, the bias is reduced by increasing complexity; when the complexity is too high, we sacrifice some amount of fit to the data to regularize our model according to the prior distribution.

We can thus think of the prediction-error signal as providing a “tether” between the mind’s representations and the world itself: the smaller the prediction error, the more tightly the mind is tethered to the real world.

The free-energy theory of the brain then generalizes variational inference to an active, embodied setting via its theory of *active* inference: not only does perception minimize the free-energy cost functional, so does *action* or *control*. The hidden state H is modelled as depending on an *action* sampled from the agent’s *control state* U as a function of time. Control states U then act as variational parameters just like V, minimizing the same free-energy cost functional, which can itself be computed from the sensory signals.

Since that functional is a kind of almost-sorta-kinda divergence between the recognition (approximate) posterior q(H; V) and the generative joint probability p(D, H), it can then be minimized by *either* adjusting the variational density towards hypotheses supported by the data, *or* by emitting actions which are believed to “simplify” the world relative to the agent’s generative prior p(H). This “prior” p(H) can then be called a *target distribution*, representing an agent’s intended goal by measuring the relative desirability of possible-worlds against each-other, and also allowing for subjective (Bayesian) uncertainty *about* the desirability of states.