The Free-Energy Theory and Precision: can a neural network catch a fly ball?

By now we’ve seen a whole lot of terminology and some links. We should look at how this view of the mind addresses a real problem, preferably one involving both perception and action, that we know real organisms can fluently solve.

Luckily, there’s baseball. The “outfielder problem” is a popular scenario to consider in cognitive science and cognitive neuroscience: how does an outfielder in baseball actually catch a fly-ball after the batter hits it? There had been three major theories as to how this is possible:

  • Trajectory Projection (TP): The fielder calculates the trajectory of a ball the moment it is hit and simply runs to the spot where it will fall (of course, taking into account wind speed and barometric pressure).
  • Optical acceleration cancellation (OAC): The fielder watches the flight of the ball; constantly adjusting her position in response to what she sees. If it appears to be accelerating upward, she moves back. If it seems to be accelerating downward, she moves forward.
  • Linear optical trajectory (LOT): The fielder pays attention to the apparent angle formed by the ball, the point on the ground beneath the ball, and home plate, moving to keep this angle constant until she reaches the ball. In other words, she tries to move so that the ball appears to be moving in a straight line rather than a parabola.

Before just announcing which, if any, of these theories turned out to be correct when the experiments were done, let’s discuss the resources the outfielder has at hand when solving the problem. They have two eyes, two ears, two hands, two feet, and one brain. Their brain has been trained on general perception, planning, and motor-control problems before this one, has seen at most dozens to hundreds (if that many) of examples close-enough to this one to really count as training data. Their brain has to solve the problem in real time, and they have limited working memory. What can they do?

It seems unreasonable to propose that the brain solves a bunch of differential equations in real time, let alone generalizing to stochastic differential equations to account for perceptual uncertainty. That puts Trajectory Projection in a bind: while it appears the most “rational” way to solve the problem, the one whose representation comes the closest to reconstructing the causal chain by which the ball reaches its destination, using it also takes too much computing power, and its parameter space has too much uncertainty (too much entropy in the prior distributions over parameters).

There’s just no evidence that humans can passively, consciously estimate baseball trajectories with that much accuracy. Worse, there’s little evidence that doing so is actually useful: small inaccuracies in parameter estimations for a high-dimensional model like that can cause large divergences between the projected trajectories and the real ones.

What does our probabilistic paradigm propose the outfielder can do? Well, our probabilistic paradigm proposes that our brain’s chief tool is the ability to minimize prediction errors between sensory signals and internal models, including the ability to perform model selection, to employ hierarchical models, and to use active inference to fit the world to the model.

The stars of our show on the perception side will be model selection and the learning of precisions (scale parameters, or estimates of entropy). Model selection lets a probabilistic brain pick out a simple-enough model to fit the (relatively) sparse, rare data coming in from the senses. Fitting precision parameters in the resulting models helps that brain to pick out the most reliable (low-variance) causal processes and sensory signals it encounters.

We can view this gambit from a neuroscientific/signal-processing perspective, from a computational one, and from an information-theoretic one. Information-theoretically, minimizing the innate uncertainty of the brain’s model ensures that the player won’t have to look at the ball so often that they can’t run to catch it. Neuroscientifically, the variational-Bayesian paradigm proposes that post-synaptic gain is modulated by the precision of predictions, thus weighting the prediction-error signals that drive updating. Computationally, we have reason to believe the compute resources needed for approximate Bayesian inference are proportional to the divergence of the posterior from the prior, so minimizing the model’s entropy implicitly minimizes that divergence “ahead of time”.

A further reduction in entropy is provided by hierarchical modelling, which supplies overhypotheses that, colloquially speaking, put the outfielder and baseball in context, and provide our first links to action. Task-specific plans of visual saccades, head movements, and running motions are planned, and task-irrelevant sensory signals (like the crowd) are neatly predicted away with low precision but high accuracy (that is, they’re mostly ignored).

Putting it together, what we’ve got is a notion of how attention (in the probabilistic paradigm, attention is precision-learning and joint-entropy minimization) frames the problem, marshalls cognitive resources to solve it, and selectively activates the bodily and environmental resources most useful for solving the problem. What we need is a way to consider action.

The precise form of the “generative priors” which specify goals under active inference is still under research, as is the distinction between active inference and reinforcement learning for many tasks. Suffice to say, though, that the “goal distributions” in active inference can come equipped with their own precision parameters and predicted outcomes, allowing for task-based endogenous attention.

So what does the experimental evidence end up saying? Well, it mostly seems to favor the Optical Acceleration Cancellation paradigm. Task-based probabilistic models, hierarchical overhypotheses, and learned precisions are all deployed to solve the problem in the fast, cheap, online paradigm characteristic of the embodied-cognition paradigm. However, here, embodied cognition doesn’t refer to arbitrary evolved heuristics: it refers to the optimal use of available resources. Our eyes and visual cortices are bad at estimating angles. However, they’re good at observing relative movement (acceleration). So the simple, accurate way to catch the ball, based on sparse but reliable sensory signals, is just to run so that the ball remains in a relatively constant place in the visual field, and prepare your hands to catch as it comes closer (appears larger).

Note the trick used by active inference here: the mutual information between perception and control has been directly maximized, subject to the probabilistically-encoded constraint that the outfielder must catch the ball. If we don’t go whole-hog on Friston’s philosophical claim that no reinforcement signals exist (which I don’t think the available evidence supports doing), then even that constraint just involves maximizing the multi-way mutual information between perception, control, and reinforcement signals.

This becomes especially plausible when we take into account that hierarchical overhypotheses further reduce the effective entropy of the problem, and an agent performing active inference can take advantage of their environment’s passive dynamics, encoded in their mind as empirical priors. Hierarchical overhypotheses screen off signals that co-occur in this task from the vast realm of possible signals that occur in all other tasks. Empirical priors over passive dynamics help to make bounded-rational action more tractable, by giving the outfielder the option to let the body or environment do the work (as in the pendulum motions involved in running with momentum) while task-based salience allocates scarce processing resources towards controlling the most important variables for catching the fly-ball.

Now we’re left with some big questions: given that the humans we’ve studied solve the outfielder problem this way, how can other models of cognition solve similar problems? How does a deep artificial neural network, for instance, handle tasks relating to intuitive physics?

Well, when it comes to predicting whether a tower of wooden blocks will fall over or not, deep artificial neural networks can achieve human-level performance. However, they require many more samples in order to learn properly; once they do, they don’t generalize as easily; and having done so, their performance is more fragile. Both a stochastic physics engine and deep convolutional neural networks can achieve “super-human” predictive performance, but only the stochastic physics engine replicates the important qualitative features of human cognition: generalization and marginalization. While the neural net can achieve “super-human” performance on its specific problem, this is because, as far as experimenters can tell, gradient descent guides it towards maximum-a-posteriori predictions, with the large training-set size making up for imprecise input signals. In contrast, humans appear to use a fully probabilistic representation which marginalizes over unknown parameters when we consciously predict, resulting in our suffering “visual illusions”: we predict wrongly a little more often because we use fewer samples to learn and thus have more actual uncertainty.

What would be the universally optimal way to predict whether the blocks fall or don’t? The typical answer for universal, optimal prediction is Solomonoff Induction: form an optimal prior over all computer programs, and use it for ongoing sequence prediction. Whenever we observe the tower, all programs inconsistent with the observation are eliminated, and their probability mass is redistributed according to Bayes’ law. A number of theorems in algorithmic probability demonstrate that this semi-algorithm will concentrate its probability mass around correct predictions faster than any computable predictor which uses programs as its hypothesis space.

The problem is that when Solomonoff Induction reasons about programs, it leaves out parameters. A physics simulator, after all, has to start from some initial state, but the same physics engine can accommodate arbitrarily many parameters and an infinite range of possible values for each. Thus, when representing the physics engine as a program within the Solomonoff Measure, each possible setting for each possible parameter is considered its own hypothesis with its own probability mass.

While this sounds nicely general (a hypothesis class consisting of all computable hypotheses), it also makes an important “metaphysical” assumption: that the “ground truth” process generating the data is pseudorandom rather than truly random, and what a human observer would consider noise (even cryptographically noisy noise!) is actually useful data for Bayesian inference. Jurgen Schmidhuber, a major proponent of modeling the real world with Solomonoff Induction, has made no secret of this belief.

The ubiquity of effective theories in the experimental sciences makes that claim both implausible and not entirely desirable to affirm. How much do we really care about learning some random parameter, compared to the causal structure in which that parameter operates?

Without debating metaphysics, we can also make a slightly different assumption: the world may contain some real randomness, while also containing pseudorandom behavior which decompresses a small random seed into a large stream of apparently random observations. In algorithmic information theory, from which Solomonoff Induction comes, every string x can be generated via some program of size K(x), and the incompressible (or random) strings are those for which K(x) is not much smaller than length(x). In the same field, the Kolmogorov structure function allows us to separate a string x into its structure and its noise: K(x) = K(structure) + K(noise) + O(1). Put back in the terms of a physics engine, K(structure) is the length of the physics engine itself, and K(noise) is the number of bits specifying the parameters and possibly a random seed. The closer the data comes to being truly noisy, the larger the value of K(noise), whereas pseudorandom data will have a small K(noise). K(structure) can include the code size for a pseudorandom number generator, as well.

We could then perform Solomonoff Induction with the execution traces of probabilistic Turing machines as the hypothesis space, learning both the structured program and the random parameters behind observed data. We can then rebuild the universal prior in a way that accounts for how many bits of noise a program uses to produce its output: fewer is better, giving a higher prior probability and a more concentrated likelihood function. This would also integrate well with real-world probabilistic programming.

Why bother? Second-order (precision) inference is already a real problem to handle, because the lower the precision of our hypotheses, on average, the higher the irreducible prediction error (probability density belonging to the areas of the support set where different causal structures make the same predictions). In general, the less precise our marginal distribution over the observation, the greater the irreducible prediction error, and the less precise our likelihood functions, the weaker our posterior inferences. A world of uniform distributions is a terrible place to live.

Furthermore, precision inference plausibly guides the formation and deployment of different conceptual ontologies and epistemologies within the mind. Concepts are deployed based on how well they fit the precision of sensory signals; ways of knowing based on how well they allow action to reduce the noise of sensory signals. We can hypothesize that simple forms of counting and logic emerge directly from this need to increase precision and reduce uncertainty: where certain rules for reasoning and action seem to apply universally, the brain can use them as overhypotheses for everything else.

What would that allow us to say about such cognitive strategies as hierarchical overhypotheses, or scientific reductionism? Hierarchical overhypotheses seem to have an information-theoretical advantage: using them yields a lower joint entropy than not using them. Likewise, reductionism and effective theories work by compressing and uncompressing parameter spaces: an effective theory makes more precise predictions with a smaller, lower-entropy parameter space, while lower-level reduced theories increase the parameter entropy while arming scientists with experimental procedures that allow very precise measurement of parameters in some situations.