Should Probabilistic Inference be Thrown Under the Bus?
So, what’s Yann LeCun talking about when he says “he’s ready to throw Probability Theory under the bus”?
LeCun spoke these words to get a reaction out of Joshua Tenenbaum, the other speaker in the room. Tenenbaum in 2011 wrote a paper “How to grow a mind: Statistics, Structure and Abstraction” that argues for a Bayesian motivated approach to achieving Artificial Intelligence.
This article attempts to explore whether such an approach makes sense and to understand LeCun’s sentiment better.
The problem with Probability Theory (more specifically, Bayesian inference) has to do with its efficacy in making predictions. Take a look at the following animated GIF:
It’s obvious that the distributions are different, unfortunately the statistical measures are identical! Said differently, if the basis of your predictions are expectations calculated from probability distributions, then you can very easily be fooled.
The method to create these distributions is very similar to the incremental method we find in Deep Learning. The method uses a perturbation method and simulated annealing. As a side, if you want to fool a statistician, then a Deep Learning method is a very handy tool.
An interesting paper “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” published in 2015 shows how you can employ perturbation methods from Statistical Mechanics to essentially recreate a specific distribution starting from random noise. A reverse diffusion method is learned to take noise back into an original distribution.
Incremental perturbation is an extremely powerful tool and it indeed is intractable using statistical methods. One significant aspect of perturbation methods is that they operate in a non-equilibrium regime. That is, very far from where the Central Limit Theorem will hold. This should establish the idea in everyone’s mind that incremental perturbative methods are effective in ways that can elude statistical detection.
Unfortunately, creating artificial probability distributions is not the real problem. The real problem is that entire practice of Bayesian inference, and its Information Theory relative, is surprisingly fundamentally flawed in non-linear domains.
James Crutchfield of the Sante Fe Complexity Science Institute had recently delivered an extremely interesting lecture in Singapore demonstrating these inductive flaws in non-linear systems (the link starts at the time that Crutchfield makes his relevant remarks):
Equations from Shannon Entropy (and by deduction Bayesian inference) that relate past and current probabilities with future predictions are essentially worthless for predictions in non-linear entangled systems. Here is the relevant paper (http://csc.ucdavis.edu/~cmg/papers/mdbsi.pdf ) to study. In this paper the author explores a Bayesian network to perform inference in a simple three variable connected network. Here are the results that should put Bayesians to question their 18th century beliefs:
In summary, we don’t know anything about these non-linear systems other than the fact that we know they work extremely well. The ramifications of Crutchfield’s discovery (this can be verified through simulations and is not through logical arguments) is that probabilistic induction does not work in non-linear domains.
Reality if of course complex and non-linear, however we’ve been fortunate enough to find small patches of reality where the effects of non-linearity can be glossed over by aggregate measures. So probabilistic induction works analogously to how one would approximate a curve with piecewise linear segments. Its a bit of a kluge, but it does work in some cases. However, it’s not a fool proof method that should be applied everywhere.
The question however that must be asked by researchers working on predictive systems is can we do better? Can we use purely perturbative methods without the requirement of probabilisitic induction? The problem with probabilistic induction is that it is a case of ‘premature optimization’. What I mean by this is that mathematics exists to take into account uncertainty. So when we implement our predictive machines using this kind of math, we implicitly bake in an uncertainty handling mechanism.
The brain likely isn’t using Monte Carlo sampling to estimate probabilities. So how then does the brain handle uncertainty? It does so in the same way that “optimistic transactions” handle uncertainty. It does so in the same way that any robust and scalable system handles failures. Any robust system assumes failures will happen and thus must have mechanisms to adapt. The brain performs compensation when it encounters something it does not expect. It learns how to correct itself through perturbative methods. That’s what Deep Learning systems also do, and it’s got nothing to do with calculating probabilities. It’s just a whole bunch of “infinitesimal” incremental adjustments.
Perturbative systems can be a bit nasty, they are after all like Iterative Function Systems (IFS). Any system that iterates into itself or has memory can either be a candidate for chaotic behavior or can be a universal machine. These systems are simply out of the domain of what’s analyzable by probabilistic methods. This is a fundamental fact that we should be ready to accept. Unfortunately, Bayesians seem to have some unassailable belief system that demands that their methods work universally.
Here’s a paper by Max Tegmark et al. (see: https://arxiv.org/pdf/1606.06737v3.pdf ) that explores the pointwise mutual information of various languages:
Note the fall off with Markov processes. In short, if your prediction engine has no form of memory, then there’s simply no way that it can predict complex behavior.
However I hear arguments that probabilistic induction (Bayes rule) applies in certain domains. What domains are these? Bernard Sheolkopf tells you exactly what domains it applies to (see: http://ml.dcs.shef.ac.uk/masamb/schoelkopf.pdf ). That is domains where anti-causality is present:
Said very simply, you can predict Y because Y is the cause of X (your input). So in fact, even for linear systems, you have to be very careful as to where you apply probabilistic induction. So, when we apply our probabilistic induction to figure out if we can differentiate between the dinosaur, star, ellipse or cross, we discover that we cannot. Why is that? That’s because the input that is observed (i.e. X) is not directly caused by its source (i.e. Y). Y was not the cause of the distribution X. Rather, there is another perturbative machinery in between that performs the obfuscation.
What if however you have the information that is used as input to this perturbative machinery? Can you then predict the machines inputs from the generated distribution? Well that would be an obvious yes!
A new paper explores the “unreliability of saliency methods”. Saliency is used in Deep Learning networks as a means of highlighting which inputs contribute most to the networks predictions. It’s been proposed many times as a way to explain the behavior of a network. Interestingly enough, the paper shows that simple transformation on the input (i.e. a constant shift) can cause the failure in the attribution:
This is indeed interesting discovery and shows that our understanding of causality in Deep Learning networks is at its infancy. By unreasonably demanding that ‘bayesian inference’ or ‘probabilistic induction’ be the guiding principle behind these networks is an assumption that stands with little evidence. Probabilistic induction has never been a fundamental feature of nature and therefore should be used with caution when trying to explain complex systems. It is sad state of affairs that the only tool in one’s arsenal is ‘Bayesian intuition’ and thus every complex problem must be framed only in this perspective.
I leave you with two quotes from Judea Pearl:
In retrospect, my greatest challenge was to break away from probabilistic thinking and accept, first, that people are not probability thinkers but cause-effect thinkers and, second, that causal thinking cannot be captured in the language of probability; it requires a formal language of its own.
this is one reality, that humans don’t use probabilistic thinking. The second quote is about the nature of probabilities and reality:
I now take causal relations as the fundamental building block that of physical reality and of human understanding of that reality, and I regard probabilistic relationships as but the surface phenomena of the causal machinery that underlies and propels our understanding of our world.
Which simply reflects how physicists think about the relationship of thermodynamics and statistical mechanics. Statistical mechanics is actually a field in physics that is incorrectly named. StatMech doesn’t use most of the statistical methods used by statisticians. Rather, the approach employs probability in its most basic form. That is simply in the calculation of the degrees of freedom in a system that will be derived from the physical constraints imposed by the particles in a system. The question that is asked is whether beginning from laws of physics and assuming a massive number of particles, can the observations in the macroscopic scale (i.e. thermodynamics) be derived. Can the microscopic phenomena predict the macroscopic phenomena?
In stark contrast, Bayesian inference and statistics attempts to discover knowledge by looking at the distributions. That is, by observing the macroscopic phenomena, can I gain new knowledge? The assumptions that are made (i.e. the prior) is that one can guess the best prior to arrive at a posterior (i.e. prediction). In many real world contexts, these kinds of systems are notoriously difficult to get correct. A recent Arxiv paper “Better together? Statistical learning in models made of modules” explains why it is insanely difficult to get these probabilistic graph models to be correct:
In principle, given data, the conventional statistical update then allows for coherent uncertainty quantification and information propagation through and across the modules. However, misspecification of any module can contaminate the estimate and update of others, often in unpredictable ways. In various settings, particularly when certain modules are trusted more than others, practitioners have preferred to avoid learning with the full model in favor of approaches that restrict the information propagation between modules, for example by restricting propagation to only particular directions along the edges of the graph.
This is an approach where its all too easy to contaminate the results with lots of human bias. In fact, the use of statistical approaches is so abused that a recent project tried to replicate 100 psychology studies and found less than half to be repeatable. The issue in question is the questionable practice of p-hacking. This is actually a mild form of the problems of introducing human bias into probabilistic calculations.
The cognitive bias that many seem to have is that they believe that the measures are an explanation of a system and not simply the effect of a system. To make it clear, don’t use Bayesian inference as a means to explain complex non-linear phenomena like cognition. Even worse, don’t use Bayesian methods as your mechanism to create artificial intelligent machines. If you got simple and less complex problems, feel free to use the appropriate tools. Just because your saw can cut wood shouldn’t mean that it can cut titanium.
At the core of intelligence is the existence of feedback loops, this implies a non-linear system with cyclic dependencies. Bayesian inference is just a tool and not a fundamental feature of either reality or intelligent systems. As a tool, it has limits in its domains of applicability. Therefore we should be cautious about using this tool as motivation for understanding complex systems. Artificial Intelligence has struggled for decades and perhaps the breakthrough may be in re-examining and questioning our own research biases.
Editor’s Note: I removed commentary about this paper comparing machine learning methods because it is a distraction from the real conversation I want to focus on: “issues of using probabilistic inference in non-linear systems”.