How Cargo Cult Bayesians encourage Deep Learning Alchemy
There is a struggle today for the heart and minds of Artificial Intelligence. It’s a complex “Game of Thrones” conflict that involves many houses (or tribes) (see: “The Many Tribes of AI”). The two warring factions I focus on today is on the practice Cargo Cult science in the form of Bayesian statistics and in the practice of alchemy in the form of experimental Deep Learning.
For the uninitiated, let’s talk about what Cargo Cult science means. Cargo Cult science is a phrase coined by Richard Feynman to illustrate a practice in the science of not working from fundamentally sound first principles. Here is Richard Feynman’s original essay on “Cargo Cult Science”. If you’ve never read it before, it great and refreshing read. I read this in my youth while studying physics. I am unsure if it is required reading for physicists, but a majority of physicists are well aware of this concept. Feynman writes:
In the South Seas there is a Cargo Cult of people. During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So they’ve arranged to make things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas — he’s the controller — and they wait for the airplanes to land. They’re doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn’t work. No airplanes land. So I call these things Cargo Cult Science, because they follow all the apparent precepts and forms of scientific investigation, but they’re missing something essential, because the planes don’t land.
The question that Feynman brings up is whether a specific practice of science is based on experimental evidence or one that just looks like scientific inquiring but is based on questionable foundations. IMHO, Bayesian inference is one of those questionable forms of scientific inquiry. It has its roots in an 18th-century conjecture:
Judea Pearl pretty much summarizes the issues with Bayesian thinking in an article published in 2001, he writes:
I [Pearl] turned Bayesian in 1971, as soon as I began reading Savage’s monograph The Foundations of Statistical Inference [Savage, 1962]. The arguments were unassailable: (i) It is plain silly to ignore what we know, (ii) It is natural and useful to cast what we know in the language of probabilities, and (iii) If our subjective probabilities are erroneous, their impact will get washed out in due time, as the number of observations increases.
Thirty years later, I [Pearl] am still a devout Bayesian in the sense of (i), but I now doubt the wisdom of (ii) and I know that, in general, (iii) is false.
Marcus Hutter in “Open Problems in Universal Induction & Intelligence” writes:
Strictly speaking, a Bayesian needs to choose the hypothesis/model class before seeing the data, which seldom reflects scientific practice.
So to summarize, it is doubtful if knowledge is represented by probabilities. Erroneous observations aren’t corrected and it’s impossible to do if you aren’t allowed to inspect the hypothesis as a guide to selecting the prior. Bayesian inference is loading with too many issues that its use is highly questionable.
Yet, Tenenbaum in 2011 “How to grow a mind: Statistics, Structure and Abstraction” explains the essence of Bayesian inference:
At heart, the essence of Bayes rule is simply a tool for answering the question: How does abstract knowledge guide inference from incomplete data?
However Bayesian inference has no guidance of how to select an initial prior and has no evolution mechanism of how knowledge changes given an initial prior. Underneath the covers, there is no engine to speak of. It’s like describing a car by observing only its external body and its wheel but completely ignoring an engine inside. That’s because statistical methods are only descriptive.
If this rule were indeed axiomatic as its proponents contend, then what then is the opinion of physicists with regards to this? Physicists who are aware of the perils of Cargo Cult science should certainly be able to spot a questionable approach. The late David MacKay wrote a well-known book “Information Theory, Inference, and Learning Algorithms“ where he explores machine learning from the perspective of Information Theory. Mackay’s book should be required reading for every Deep Learning practitioner. David Mackay is a physicist by training, he writes in his book:
In this book it will from time to time be taken for granted that a Bayesian approach makes sense, but the reader is warned that this is not yet a globally held view — the field of statistics was dominated for most of the 20th century by non-Bayesian methods in which probabilities are allowed to describe only random variables. The big difference between the two approaches is that Bayesians also use probabilities to describe inferences.
He then further devotes an entire chapter on “Bayesian Inference and Sampling Theory”. Here he writes:
This chapter is only provided for those readers who are curious about the sampling theory / Bayesian methods debate. If you find any of this chapter tough to understand, please skip it. There is no point trying to understand the debate. Just use Bayesian methods — they are much easier to understand than the debate itself!
The only people who understand Bayesian inference are the Bayesian themselves. The only way to understand them is to drink their Koolaid. All arguments are dismissed because you don’t understand what Bayesian means.
The statistical community has a habit of making arguments on the basis of obscurity. Here’s a 2014 speech by John Rauser that highlights the problem:
The practice of statistics is in fact closer to alchemy than that of science. Take a look at this ridiculous taxonomy of univariate distributions:
The method of argument in statistics is to throw in some combination of distribution from above and use these as your assumptions (i.e. prior) as to how you arrive at a conclusion. It’s alchemy disguising itself in the language of mathematics. It is not enough to give names to different kinds of distributions and mix it all up in the cauldron of Bayesian inference to arrive at a conclusion.
It is non-sensical for those who grew up understanding computation. How is this practice any different from the multitude of theories proposed by linguists to understand language? I guess Fred Jelinek was on to something fundamental when he remarked:
Every time I fire a linguist, the performance of our speech recognition system goes up.
Perhaps there is an equivalent to this in deep learning? “Every time you fire a statistician or Bayesian, then the performance of your deep learning system goes up.” ;-) The insinuation of Jelinek’s quote is that premature idea of how complex systems work can be detrimental to its performance. We understand this in computer science as premature optimization, where if we pre-maturely optimize a subcomponent it can become a performance bottleneck later.
The legendary Isaac Newton was in fact very involved in alchemy. Here’s an image of his manuscript on the subject of transmutation for gold:
Isaac Newton was also from the 18th century, just like Thomas Bayes. When you don’t have a foundation of strong first principles, then everything is alchemy. It’s the human mind’s natural state to keep on making up stuff just because we observe patterns often enough. Repeating falsehoods often enough doesn’t make it true, yet humans are susceptible to this cognitive bias. (The last sentence looks awful like Bayesian inference!) At best, Bayesian inference is a human heuristic that masquerades itself in seemingly logical mathematics.
Chemistry exists because we understand the first principles of how atoms can be combined (derivable from quantum physics). The first incarnation of the periodic table of elements actually came before quantum physics. It was derived experimentally and only after centuries did they formulate an elegant explanation on the configuration of electrons in a valence shell of an atom.
The real reason why others don’t understand Bayesian inference is that they recognize Cargo Cult science and can’t believe seemingly intelligent people steadfastly believe in this.
Well, Bayesian methods are a belief system. It is not very different from Occam’s razor, that is explanations must be simple. To be perfectly fair, physicists also have their own belief system, one of them is that there exists a Grand Unified Theory. This was Albert Einstein’s goal all the way to the end of his life. The difference, of course, is that in the quest for knowledge, one’s belief system should remain only as a motivation for that quest and not the explanation of everything.
There was a time before the advent of Deep Learning that Bayesians were rulers of the Machine Learning field. Max Welling captures this in his essay “Are ML and Statistics Complementary?”. Welling writes the following:
Also, the previous “hype” in machine learning (before deep learning) was about nonparametric Bayesian methods, clearly a core domain of statistics. At the same time, there are cultural differences between the two fields: where statistics is more focussed on statistical inference, that is, explaining and testing properties of a population from which we see a random sample, machine learning is more concerned with making predictions, even if the prediction can not be explained very well (a.k.a. “a blackbox prediction”).
Former rulers of the ML community do come from a Bayesian background and this explains why many papers in Deep Learning are explained from a Bayesian viewpoint. I’ve argued elsewhere why it is an incorrect viewpoint, however like many things in human discourse, it’s very difficult to dislodge orthodox thinking. The old guard will fight to the death to preserve their mysterious way of thinking.
This old guard would like one to believe that all inquiry should be framed in Bayesian terms. They borrow or steal ideas from other methods of inquiry and regurgitate these as being of Bayesian origin. One clear example is the use of variational methods. These methods are of statistical mechanics origin, however, they’ve recast the techniques as originating from Bayesian thinking. Yann LeCun, in a FaceBook post, documents the history of these methods, he writes:
the main concepts were inspired by statistical physics, not by Bayesian statistics, AFAIK, the authors were unaware of the Bayesian inference literature at of the time.
He writes this in context of a paper written by Yarin Gal ( a student of the prominent Bayesian Zoubin Ghahramani). LeCun writes that Gal miscredits the origins of several papers as being of Bayesian origin which he refutes on personal historical grounds. Bayesians are indeed colluding to extend their influence on the nascent Deep Learning field. Work using statistical physics and information theory approaches are being deconstructed and explained as being Bayesian when the authors have never subscribed to a said belief system.
My perspective of Deep Learning is that it is an experimental science. Our experimental apparatus is the massive computation that we currently have at our disposal. These computer systems serve as a way for us to discover emergent predictive behavior that arises from homogenous simple computational elements (i.e. artificial neural networks).
Vladimir Vapnik who comes from a different (and more formal) machine learning discipline (see: SVM) has the following beliefs about machine learning in general:
Vapnik posited that ideas and intuitions come either from God or from the devil. The difference, he suggested is that God is clever, while the devil is not. … Vapnik suggested that the devil appeared always in the form of brute force. Further, while acknowledging the impressive performance of deep learning systems at solving practical problems, he suggested that big data and deep learning both have the flavor of brute force.
This idea that discoveries arrive through brute force (computation) emphasizes the current experimental nature of the Deep Learning field. Vapnik’s arguments are more on testament of his belief system and a gut feeling that the current lack of theory is problematic. The theories that exist out there are extremely weak and a majority of the theories are poised in questionable Bayesian terms. There are of course alternative theories that originate from the field of Information Theory (Tali Tishby), Statistical Mechanics (Surya Ganguli) or even from Cosmology (Max Tegmark).
Theoretical progress in Deep Learning should not be hindered by historical baggage like Bayesian methods. There are many more advanced models of reality that come from other fields such as Complexity Science, Critical Phenomena, Non-equilibrium statistical mechanics, Chaos Theory and Cybernetics that I would like to see applied to the explanation of Deep Learning.
The problem with Bayesians is that they don’t understand the domain of applicability of their belief system. A paper “Statistical physics of inference: Thresholds and algorithms” goes in great detail regarding this question. You are more than welcome to pour your intellectual energies into this study. To summarize that paper, the answer to Bayesian applicability depends on how much information you have prior. Unfortunately, reality is not so kind as to provide one with perfect information.
It is entirely a travesty that a majority of Deep Learning explanations are framed in a dubious and antiquated belief system. One would think that there’s a conspiracy going on that favors Bayesian theories over unfamiliar theories using unfamiliar vocabulary and mathematics. We need more powerful mathematical tooling to analyze discoveries in Deep Learning, otherwise, it will forever remain in its current state of alchemy.
Editor’s Note: For all those who keep complaining about this post, let me be perfectly clear: “Bayesian inference is a human heuristic”. It is not a fundamental theory, it is by design a subjective form of logic and therefore is disingenuously used in many places were it should not. See Pearl ( http://ftp.cs.ucla.edu/pub/stat_ser/r284-reprint.pdf )
Additional Reading