Oleg Solopchuk
8 min readMar 21, 2019

Free Energy, Action Value, and Curiosity

The essential feature of Active Inference (AI) is the value that guides action selection — a sum of reward and information gain. The exploration-exploitation tradeoff is an old dilemma, but AI proposes a twist — rewarding observations are assumed to be likely under agent's innate beliefs. This brings reward to the same units as information gain (negative log probability), which supposedly makes their tradeoff natural. The appealing claim is that the above comes from a single assumption — minimization of surprise, better known as the free energy principle. However, this could be easier read than understood, as the exact formulation changed over time, and the lexicon is somewhat overloaded. So let's try to dissect this intriguing action value, reiterating some points from this more technical (and less intuitive) tutorial.

Imagine a simple agent that believes its environment can be in one of the 2 states s”, and each of these states can cause one of 3 possible observations o”, with a certain probability p(o|s). If you like, the states could stand for food/no food and the observations would be 3 possible activations of a photoreceptor. So the agent's model consists of beliefs p(s) on how probable each of this states is a priori, as well as the state-observation mapping p(o|s).

Note: state-observation mapping on the right is the probability of observation given a particular state, or — a likelihood of states given a particular observation. While both of them are denoted by p(o|s), probability refers to rows, and likelihood to columns.

When the agent observes something (say, o = 2), this sensory information updates the current belief about the underlying cause s. More formally, sensory likelihood p(o|s) is used to update the prior p(s) to the posterior p(s|o). Practically, this lies in 2 simple steps — multiplying sources of evidence to get the joint probability p(o, s), and then normalizing it so that the resulting probabilities sum to 1 (i.e. to 100%).

The normalization constant ‘marginal/model evidence’ scores how likely the current observation is under this particular model. So if we had another model, we could again sum all states s to get another marginal. We’d then combine the marginals for 2 models with priors over models to get posterior beliefs over models given the observation. Visually, this would look exactly like the picture above, but with states s replaced by models m. Without prior preferences over models, we could use the ratio of marginals, which is known as Bayes Factor.

The problem is that if there are too many states s, the summation required to normalize the joint to posterior can be hard to compute. Thus, AI proposes that instead of directly normalizing the joint, we could search for some (normalized) distribution q(s) that is as close as possible to the joint as measured by some 'distance', and then proclaim it as posterior because it: 1) looks like the joint, 2) is normalized. One such possible 'distance' is called 'variational free energy', and so by minimizing it, we just make some arbitrary normalized distribution very similar to the joint, i.e. unnormalized posterior.

Variational Free Energy (FE) is an average log-ratio of an arbitrary distribution q(s) and the joint. If we expend the joint in the denominator as posterior times the marginal, we could split FE into 2 terms using log(ab) = log(a) + log(b) and log(1/a) = -log(a). These terms are 1) KL divergence between approximation q(s) and posterior p(s|o) [Sum q(s) log q(s)/p(s|o)] and 2) minus log of the marginal (surprise). If the KL divergence term is minimized — the only thing left is surprise, and since KL cannot be negative — we can conclude that FE is an upper bound on surprise (i.e. surprise + KL). Check here for the derivation.

So while free energy is a useful trick for probabilistic perception, the main focus of AI is on action. This means that in addition to the hidden states, we also have to infer the policy (sequence of actions) π. Crucially, an agent has some prior beliefs about policies p(π), and the key idea is that if perception minimizes free energy —so should action too. This means that prior probability of a policy should be proportional to the (negative) free energy expected under that policy (G) [negative, because smaller G should correspond to higher probability of the corresponding policy]. This assumption nicely links future and now, since free energy of the future (G) is packed inside the current free energy F. Thus, in addition to perception, another way to minimize F — is to minimize G through action. As often said, perception and action are just 2 sides of the same coin.

A policy is a sequence of actions, and each action is represented as a multiplication of the state vector by a transition matrix. For example, action 1 shown above (matrix B1) does not change the belief about the state, so the prior beliefs at the next time step are the same as the posterior at the previous one. The equation for variational FE (bound on surprise) shown on the bottom is the same as on the previous picture, except that we add the policies to both the joint (denominator) and approximate posterior (nominator and expectation). Moreover, sensory inference becomes conditioned on a policy π (so we write s|π). We could assume, for simplicity, that the likelihood p(o|s,π) is the same for all policies and could be kept as just p(o|s). Note: technically q(s) at previous time steps are also conditioned on policies. You could imagine that instead of the split above — we’d have parallel chains of inference for each policy.

However, the form expected Free Energy G should take is not obvious, so let's try to derive it from the first principles. Let's dissect one time step in the imagined future, simulating our prediction if we were to follow one specific action of one specific policy. The obvious problem with planning is that we haven't observed the sensory data yet. So instead, we could imagine inference for all the possible observations in parallel, performing the same steps as before — multiplying likelihood and prior to get the joint, and normalizing the marginal to get the posterior.

Since we don’t know the future observation, we average free energy of the future for every possible observation [i.e. for every state, we take expectation wrt p(o|s)]. This way we have the joint q(o,s|π) both in the denominator and in the expectation before the log. The denominator can be split using log(ab) = log(a) + log(b) and log(1/a) = -log(a). This would result in 2 separate terms: negative information gain [Sum q(o,s) log q(s)/p(s|o)] and entropy [- Sum q(o) log q(o)]. Information gain (i.e. KL divergence, i.e mutual information) tells us how much our uncertainty about the state decreases once we know the observation (or vice-versa). For example, information gain between a switch and a light-bulb is 1 bit, because the light can be either on or off, and position of the switch is a great predictor of it. Similarly, information gain between a screen and a keyboard can be infinitely large. Check here for a more detailed derivation.

If we were to literally use the same free energy as before (but averaging over unknown observations), we would get an expression with 2 terms. First, negative information gain (difference between prior and posterior) is a form of directed exploration, also called epistemic value and would encourage us to seek observations that are informative about the hidden states. In formal terms, information gain represents the mutual information between states and observations. Since it's negative, the higher the information gain — the smaller the free energy. Second, the entropy of predictive distribution over observations q(o|π), quantifying our uncertainty over the future observation. The problem is that while information gain would make an agent curious, minimizing the entropy term would have an exact opposite effect — encouraging agents to sit in a dark room (and being certain of next observations). But actually, this objective seems a bit short-sighted: if we are planning into an unknown future (potentially over a long time horizon), why should we rely only on a predictive q(o|π) which is solely based on the recent inference [i.e. current q(s)]? Maybe we could use a predictive distribution based on all the experience we had, potentially throughout the entire life. And maybe, we could even use the average predictive distribution based on the evolution of our species. This is exactly what is proposed in Active Inference, so we replace q(o|pi) with this 'evolutionary priors', called p(o). And since we survived until now, it means that we should prefer to observe things that have high probability under this distribution — things, which we may call…rewarding. So while some agents would love to stay in a dark room (cave), we don't like it because that's not the natural environment for our species.

The (log) prior preferences encoded through p(o) are proportional to utility of each observation, and are defined in advance by the modeler and are assumed to be shaped by the evolution in real agents.

Now, after replacing q(o|π) with p(o), we are left with the negative information gain and expected 'reward surprise' — negative log p(o), both measured in the same units (e.g. bits if the logarithm is binary). Minimizing expected free energy would thus lead to maximizing the information gain and reward. Thus, policy selection would rely on exploration and exploitation.

This is not the whole story though. Above, we assumed that the parameters of agent's model - state-observation mapping p(o|s) and state transition mapping p(s_t|s_t-1) are known and fixed. But what if there is also uncertainty over these, and in addition to inferring the unknown state s agents also want to reduce the uncertainty (learn) about model parameters. Let's focus on p(o|s) and call this mapping matrix A, in which case the correct (complete) way to write this distribution would be p(o|s, A). Having uncertainty over A means that we should have some prior beliefs p(A) which we want to update to the posterior beliefs given the observation p(A|o).

For each state (for each row of A) we have 3 numbers that sum up to 1, and reflect the probability of each of the 3 observations given that specific state. So we could represent these 3 numbers on a simplex, in which each vertex corresponds to an observation, and any point within — to 3 numbers that sum up to one. For example the point in the middle would correspond to [1/3, 1/3, 1/3], and the point on the first vertex to [1, 0, 0]. We could now assign a probability to every point on this triangle, and a useful distribution for doing that is called Dirichlet. It is parametrized by 3 numbers that intuitively represent counts of how many times we encountered a particular observation. Some example distributions are visualized on the right. Note: we have 1 Dirichlet prior for every s (row of A).

This way, the joint distribution would read as p(o, s, A) = p(o|s, A) p(s) p(A). While the exact calculation of the posterior p(A|o) is intractable, we could use the same trick as above - define an arbitrary normalized distribution q(A) and make it as close as possible to the unnormalized joint, as scored by variational free energy. If we were to calculate which q(A) would minimize free energy, it would turn out that it has the same form as the prior p(A) (Dirichlet distribution), in which the parameters (counts) are updated with sum of q(s) at a corresponding observation over all time steps [check here for an excellent explanation]. Thus, in addition to inference and action, learning too is expressed as free energy minimization (shown on the left below).

Now, we can also augment expected free energy G that drives action selection, with model parameters A. Following the same logic as before, in the denominator, we would replace the joint q(o, s, A) with the joint of the future p(o) p(s|o) p(A|s, o), and get an extra term — negative information gain regarding model parameters.

This way, the agent would have 2 different forms of curiosity — one about reducing uncertainty over the hidden state (salience) and the other about reducing uncertainty over model parameters (sensory novelty). While we focused on the state-observation mapping A, we could add the uncertainty over the state-transition matrix (called B) too, both into variational free energy F, and the expected free energy G. While the first would be used for learning the approximate posterior over parameters q(B), the second would result into still other form of curiosity — transition novelty.

Functionals of curiosity. Every green connection represents mutual information between observations and a corresponding variable. All of the above forms of curiosity except empowerment can be included as parts of expected free energy optimized by actions under the free energy principle. Also, empowerment does not maximize information gain as in 'learning a better model', as a good model of the environment is assumed before its calculation (see more on page 9 here)

This unification is quite appealing, as we start from the assumption of minimizing free energy in the future, and end up with action value that includes utility and different forms of curiosity. In contrast, traditional approaches pick one of the curiosities shown above and plug it to action value with some curiosity parameter. Simpler yet, others use a mismatch between predicted and observed observation (i.e. the worse is my prediction — the more I should be curious about that part of the environment), along with corrections against the noisy TV problem — infinitely 'interesting' unpredictable noise. (note: this overview is by no means complete, as there are other measures of curiosity, based not directly on mutual information, but auxiliary measures such as learning progress).

Clearly, an active inference agent can only be as curious as its modeled q(o|π) (predictive observations in the figure above) is representative about the real probability of observations, so curiosity could be quite imprecise early in learning. But once it starts to figure out the statistics of the surroundings, having a principled motivation functionals should make learning faster than if only exploring via random actions. You can play with the code, read about the empirical results simulating the importance of different forms of curiosity, and check a more detailed review on curiosity. Thanks to you for reading, to Alex Zénon and Thomas Parr for feedback and Pierre-Yves Oudeyer and Karl Friston for discussions.