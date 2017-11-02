Why Probability Theory Should be Thrown Under the Bus

So, what’s Yann LeCun talking about when he says “he’s ready to throw Probability Theory under the bus”?

This article attempts to explore this sentiment.

The problem with Probability Theory has to do with its efficacy in making predictions. Take a look at the following animated GIF:

It’s obvious that the distributions are different, unfortunately the statistical measures are identical! Said differently, if the basis of your predictions are expectations calculated from probability distributions, then you can very easily be fooled.

The method to create these distributions is very similar to the incremental method we find in Deep Learning. The method uses a perturbation method and simulated annealing. As a side, if you want to fool a statistician, then a Deep Learning method is a very handy tool.

An interesting paper “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” published in 2015 shows how you can employ perturbation methods from Statistical Mechanics to essentially recreate a specific distribution starting from random noise. A reverse diffusion method is learned to take noise back into an original distribution.

Modeling Complex Data by Reversing Time

Incremental perturbation is an extremely powerful tool and it indeed is intractable using Statistical methods. One significant aspect of perturbation methods is that they operate in a non-equilibrium regime. That is, very far from where the Central Limit Theorem will hold. This should establish the idea in everyone’s mind that incremental perturbative methods are effective in ways that can elude statistical detection.

Unfortunately, creating artificial distributions is not the real problem. The real problem is that entire practice of Bayes Theory, and its Information Theory relative, is fundamentally flawed in non-linear domains.

James Crutchfield of the Sante Fe Complexity Science Institute had recently delivered an extremely interesting lecture in Singapore demonstrating these flaws in non-linear systems (the link starts at the time that Crutchfield makes his relevant remarks):

Equations from Shannon Entropy or Bayes Theory that relate past and current probabilities with future predictions are essentially worthless for predictions in non-linear entangled systems. Here is the relevant paper (http://csc.ucdavis.edu/~cmg/papers/mdbsi.pdf ) to study. Here’s a figure in this paper that should put Bayesians to begin to question their 18th century beliefs:

In summary, we don’t know anything about these non-linear systems other than the fact that we know they work extremely well. The ramifications of Crutchfield’s discovery (this can be verified through simulations and is not through logical arguments) is that probabilistic induction does not work in non-linear domains.

Reality if of course complex and non-linear, however we’ve been fortunate enough to find small patches of reality where the effects of non-linearity can be glossed over by aggregate measures. So probabilistic induction works analogously to how one would approximate a curve with piecewise linear segments. Its a bit of a kluge, but it does work in some cases. However, it’s not a fool proof method that should be applied everywhere.

The question however that must be asked by researchers working on predictive systems is can we do better? Can we use purely perturbative methods without the requirement of probabilisitic induction? The problem with probabilistic induction is that it is a case of ‘premature optimization’. What I mean by this is that mathematics exists to take into account uncertainty. So when we implement our predictive machines using this kind of math, we implicitly bake in an uncertainty handling mechanism.

The brain likely isn’t using Monte Carlo sampling to estimate probabilities. So how then does the brain handle uncertainty? It does so in the same way that “optimistic transactions” handle uncertainty. It does so in the same way that any robust and scalable system handles failures. Any robust system assumes failures will happen and thus must have mechanisms to adapt. The brain performs compensation when it encounters something it does not expect. It learns how to correct itself through perturbative methods. That’s what Deep Learning systems also do, and it’s got nothing to do with calculating probabilities. It’s just a whole bunch of “infinitesimal” incremental adjustments.

Perturbative systems can be a bit nasty, they are after all like Iterative Function Systems (IFS). Any system that iterates into itself or has memory can either be a candidate for chaotic behavior or can be a universal machine. These systems are simply out of the domain of what’s analyzable by probabilistic methods. This is a fundamental fact that we should be ready to accept. Unfortunately, Bayesians seem to have some unassailable belief system that demands that their methods work universally.

Here’s a paper by Max Tegmark et al. (see: https://arxiv.org/pdf/1606.06737v3.pdf ) that explores the pointwise mutual information of various languages:

Note the fall off with Markov processes. In short, if your prediction engine has no form of memory, then there’s simply no way that it can predict complex behavior.

However I hear arguments that probabilistic induction (Bayes rule) applies in certain domains. What domains are these? Bernard Sheolkopf tells you exactly what domains it applies to (see: http://ml.dcs.shef.ac.uk/masamb/schoelkopf.pdf ). That is domains where anti-causality is present:

Said very simply, you can predict Y because Y is the cause of X (your input). So in fact, even for linear systems, you have to be very careful as to where you apply probabilistic induction. So, when we apply our probabilistic induction to figure out if we can differentiate between the dinosaur, star, ellipse or cross, we discover that we cannot. Why is that? That’s because the input that is observed (i.e. X) is not directly caused by its source (i.e. Y). Y was not the cause of the distribution X. Rather, there is another perturbative machinery in between that performs the obfuscation.

What if however you have the information that is used as input to this perturbative machinery? Can you then predict the machines inputs from the generated distribution? Well that would be an obvious yes!

I leave you with two quotes from Judea Pearl:

In retrospect, my greatest challenge was to break away from probabilistic thinking and accept, first, that people are not probability thinkers but cause-effect thinkers and, second, that causal thinking cannot be captured in the language of probability; it requires a formal language of its own.

this is one reality, that humans don’t use probabilistic thinking. The second quote is about the nature of probabilities and reality:

I now take causal relations as the fundamental building block that of physical reality and of human understanding of that reality, and I regard probabilistic relationships as but the surface phenomena of the causal machinery that underlies and propels our understanding of our world.

Which simply reflects how physicists think about the relationship of thermodynamics and statistical mechanics. The cognitive bias that many seem to have is that they believe that the measures are an explanation of a system and not simply the effect of a system. To make it clear, don’t use Probability Theory as a means to explain complex non-linear phenomena like cognition. Even worse, don’t use Probability methods as your mechanism to create artificial intelligent machines. If you got simple and less complex problems, feel free to use the appropriate tools. Just because your saw can cut wood shouldn’t mean that it can cut titanium.

Editor’s Note: I removed commentary about this paper comparing machine learning methods because it is a distraction from the real conversation I want to focus on: “issues of using probabilistic inference in non-linear systems”.

BTW, here’s a new paper from Google: “A Bayesian Perspective on Generalization and Stochastic Gradient Descent”. Which begs the question, why are smart people invoking spells like “Bayesian intuition” to obfuscate that they are actually just doing alchemy? I suspect that the use of the term Bayesian or Gaussian in papers is more to play to the sentiments of the orthodoxy and is at the expense of more precise language but equivalent language.

Explore Deep Learning: Artificial Intuition: The Unexpected Deep Learning Revolution