# Free-energy, reinforcement, and utility

So given this paradigm in which the agent minimizes their prediction error using both first-order (observable variable) and second-order (precision parameter) probabilistic inference, how do we account for the actions our outfielder needs to run and catch the fly ball? Traditional views in cognitive science, neuroscience, and artificial intelligence usually defer the direction of action to reinforcement learning or optimal control theory: either a reward signal is maximized or a pre-specified cost function is minimized.

By the Complete Class Theorem, every admissible decision rule for optimizing some function is equivalent to optimizing the expectation of that function under some Bayesian posterior probability distribution. We are thus guaranteed that we can interchange freely between distributions and cost functions.

We already have a pair of perfectly decent cost functionals: the variational free-energy, and the prediction error. Neither of them appears naively to prescribe anything about action. However, we should notice that the complexity component of the free-energy functional measures only the divergence between the prior and approximate-posterior densities with respect to the latent hypotheses, not with respect to observable variables. The total free-energy cost can actually be decreased in two ways:

- Increase the accuracy (decrease the surprisal) by adjusting the variational parameters V to more closely fit the recognition posterior q(H; V) to the generative likelihood p(D | H).
- Change the underlying hidden state H somehow, thus reducing the complexity without affecting the accuracy at all.

This insight gives rise to the theory of *active inference*, in which the hidden state H is modelled as depending on an *action* A or *control state* U sampled from the agent as a function of time. Here, actions A can be optimized just like variational parameters V, both minimizing the same free-energy cost functional, which itself can be computed directly from the sensory information.

Since that functional is a kind of almost-sorta-kinda divergence between the recognition (approximate) posterior q(H; V) and the generative joint probability p(D, H), it can then be minimized by *either* adjusting the variational density towards hypotheses supported by the data, *or* by emitting actions which are believed to “simplify” the world relative to the agent’s generative prior p(H). This “prior” p(H) can then be called a *target distribution* (or target density), representing an agent’s intended goal by measuring the relative desirability of possible-worlds against each-other, and also allowing for subjective (Bayesian) uncertainty *about* the desirability of states.

So how does the outfielder’s brain generate actions which lead to catching the ball? The idea is that when the brain engages in active inference, it begins by probabilistically modelling the causal trajectories in which the outfielder catches the ball, and then minimizing the prediction errors generated by comparing that model to the real world via perception. Perception then fits the recognition density to the generative likelihood, while control over the motor cortex and the body fit the recognition density to the target density.

The distinction between perception and control in active inference lies in which variables are considered observed, which ones held constant, and which ones updated based upon the observation. In perception, we observe an incoming sensory signal, hold the outgoing motor signals constant, and update our model of the world. In control, we “observe” a goal state or target distribution within our model of the world, hold the incoming sensory signal constant, and update the outgoing motor signal. All updates serve to minimize the prediction error (expressed as variational free energy or as Kullback-Liebler divergence) made by the model of the world about the incoming sensory signal.

Why should we think in terms of Kullback-Liebler control, in terms of minimizing prediction errors, rather than in terms of standard optimal-control theory or utility theory? Aren’t these all, at some level, equivalent ways to phrase action problems? Friston says no:

[T]here are policies that can be specified by priors that cannot be specified by cost functions. … A policy or motion that is curl free is said to have detailed balance and can be expressed as the gradient of a Lyapunov or value function (Ao, 2004). The implication is that only prior beliefs can prescribe divergence-free motion of the sort required to walk or write. This sort of motion is also called solenoidal, like stirring a cup of coffee, and cannot be specified with a cost function, because every part of the trajectory is equally valuable.

Under the active inference formulation, real behaviors that were difficult to model under utility-theoretic or control-theoretic frameworks become relatively easy to describe in terms of expectations over events, formulated as target densities, which control fulfills by just minimizing prediction errors. Since we can form a target density over anything we can model probabilistically, which we normally take to be just about everything, we can re-use the same cognitive language for both describing the world and *prescribing* it.

This means that our agent can use hierarchical overhypotheses to model the environment abstractly, and we can prescribe goals in terms of those abstractions. As we become able to describe intertheoretic reductions and bridge laws in terms of probabilistic modelling, we’ll also become able to translate target distributions between theories. Translating a target distribution from one theory to another just involves sampling from the original target distribution and passing the sample to the bridge law — just like that, we’ve constructed a probabilistic program describing the target distribution in the new theory. Randomly sampling the free parameters of the new theory then allows us to marginalize over them in the model, thus making our bridged target distribution “fully Bayesian” about uncertainty in the different theories while retaining all goal information.

By using distributions like the Boltzmann Distribution in which some arbitrary function is “mixed” with a prior probability density to “weight” that density, we can describe various sorts of goal functions:

- Confining an observable or causally latent variable to some region of its state-space (optimal control),
- Maximizing the value of some observable or causally latent variable over time (utility theory),
- Goals with nontrivial causal or conditional structure (“if X then Y, else Z”), or
- Goals described by propositional rules, constructed mathematically by conditioning the perceptual posterior distribution on some proposition, or
- No goal at all (a uniform or maximum-entropy target distribution).

Using active inference for KL-control also handles random noise in sensory and motor signals more or less automatically: noise comes through as just a little more prediction error to be minimized.

What does it look like, mathematically, to form target distributions based on Boltzmann distributions, and how does it potentially relate to the embodied needs and desires of real human beings? We don’t have a thoroughly vetted, empirically well-supported theory yet, but we can begin to sketch one based on the contents of this paper. The sketch amounts to mixing a “passive” prior density with embodied reinforcement signals and a Friston-style precision parameter for those reinforcement signals.

We can consider a prior density p0(x) over states of the environment, which encodes purely empirical information with no regard for reinforcement signals. We can then write R(x) to be a reinforcement signal, given as a function of the state and defined by the way the organism is embodied (for instance, how nerves passing pleasure and pain signals are wired). We can then also treat β as a “rationality parameter” (really a “temperature”, or precision, parameter) which affects how strongly any given reinforcement signal affects the target distribution. We then write the resulting Boltzmann distribution:

p(x) = p0(x) * exp(β * -R(x)) / Σx [p0(x) * exp(β * -R(x))]

We can imagine that a very positive β (normative precision, we can call it) imposes more *pull* on the agent, prescribing more rational and precise action to reach the target state. A very negative normative precision does the opposite: *push* the agent strongly *away* from an anti-target state. As β falls in absolute value towards zero, the “normative force” in the Boltzmann falls, and the distribution turns back into the purely empirical prior p0(x).

From here, we just imagine an agent who starts with some typical probabilistic beliefs p0(x) and p(x | s) about their non-normative sensory signals, and some probabilistic beliefs about p(β | x) and p(R | x). The agent then observes s, β, and R: the non-normative sensory signals, the normative precision, and the normative reward-signal itself. These are combined via the above equation to form some p(x | R, β, s), with x being the hidden distal causes of the sensory signal (including the normative one) rather than just the sensory signal itself.

The observed values of R (and how tightly they correlate with x and s) can then weight events as favorable and disfavorable, while observations of β determine the normative precision (how spread-out the target density is in the neighborhood of x).

With a target distribution having been updated, active inference can then use this p(x | R, β, s) as its “prior”. This target density “prior” will thus make predictions, which will generate prediction errors, which will be quashed through action. To include all sensory signals, we should modify the likelihood function to be p(s, R, β | x). Thus, only reward prediction errors (prediction errors about R(x) and β) will be able to update the (learned) target density: the agent’s behavior will display an is-ought gap between the effects of its merely perceptual signals and those of its normative signals.

Furthermore, as the agent acquires an accurate causal model of the distal causes behind its normative signals, it will learn what amounts to a utility function. It will minimize the reward prediction error efficiently, so as time goes on, the agent’s normative prediction errors will be resolved less through updating the target density and more through action. The agent will *learn what is good for it*, as long as the distal causal structure behind the normative signals remains the same. The agent will thus appear to behave according to a utility function, as shown in a modified version of equation (4) from “The Anatomy of Choice”.

We treat x as being an observed state, x’ as a possible future state, R(x) as a reward function over states, and u as a control variable from which actions are sampled. β is the normative precision as before. Active inference then specifies that an agent should act according to the following:

log P(u | x) = β*H(x’ | x, u) + βΣx’[Ex’ | x, u[R(x’)]]

In words, this means that the probability of taking a particular action is proportional to the entropy of prospective states reachable via that action (giving an intrinsic bonus to exploratory behavior) and the action’s expected reward with prospective states marginalized out. The normative precision controls the trade-off.

Of course, in active inference with a Boltzmann target distribution, the agent learns the reward function and normative precision as well as how to predict states and emit useful actions. That makes this form of active inference closer to reinforcement learning than anything else. However, since this would be probabilistic reinforcement learning in a setting that typically uses hierarchical Bayes models, the agent doesn’t just learn a reward function, it learns a *context-sensitive* reward function. The hierarchical overhypotheses governing both the agent’s recognition and target densities allow reward signals and actions to be conditioned on situation and context, and to “fill in the blanks” when certain information is missing. We can thus instead write the equation as:

log P(u | x) = β*H(x’ | x, u) + βΣx’[Ex’ | x, u[RR | x(x’)]]