Bayesian School vs. Classic statistics School

Chier Hu
Mr. Translator
Published in
6 min readApr 5, 2020

--

Bayesian theorem and Bayesian network based on Bayesian theorem are probably the most widely used in the field of artificial intelligence, which are one of the core methods of machine learning, from simple spelling correction and spam filtering to image recognition and machine translation. Almost everywhere.
The inventor of Bayesian Theorem is the English Reverend Thomas Bayes.

Speaking of Bayes, he is not a professional mathematician.
After graduating from the University of Edinburgh, he followed in his father’s footsteps and became a priest of the Presbyterian Church in England.
As a Protestant priest, he spent his life trying to prove the existence of God.
But in proving the existence of God, Bayes encountered a thorny problem, that is, unable to get complete information.
How to deal with some unknown information and conditions bothered him.
During the day, as he faithfully served his God and conscientiously performed his teaching duties, he meditated; at night, the candlelight went out, he wrote at his desk, wrote and calculated, and deduced arguments.
Unfortunately, he was not able to prove the existence of God to his death, but his research created a principle in probability statistics.
The Bayesian formula he invented became the golden key to deal with this kind of uncertainty, and somehow became a cornerstone of artificial intelligence.
So, what is the Bayesian formula?
How does it solve the problem of uncertainty?
The Bayesian formula goes like this:

The P here represents the probability of the occurrence of the event Dj, which we call the probability of the occurrence of Dj.
Event Dj is a randomly selected sample of total event x.
How to understand this seemingly complex formula?
We use an example to illustrate how this formula solves the problem of uncertainty.

Suppose you have a radio, and a radio station will send out one of two fixed signals after midnight. We use signal A and signal B to distinguish them.

One night, we are going to receive the signal from this radio station, and guess which one of A or B will be sent by this radio station today?
Simple, let’s just listen to it.

But suppose the radio station is far away from us, and the signal received contains a lot of noise.
What should I do?
We know the probability of the signal sent in the past, can we infer whether the signal sent by the radio station tonight is A or B?

Let’s define the signal received tonight as y, and the actual signal as x (x can only be A or B).
Then, the probability that the received signal is A can be written as P (x= A | y).
The probability that the received signal is B can be written as P (x=B | y).
A logical rule of judgment should be: if the probability of sending A tonight is greater than the probability of sending B, then the signal sent tonight may be A, and vice versa.
But the problem is that we don’t know exactly how likely they are tonight.
At this time, Bayesian formula can help us solve this problem, which allows us to calculate the probability we want indirectly by guessing or obtaining other probabilities.

If we know the probability p (x = a) and P (x = b) that they emit every night before, we only need to calculate the P (y|x) when x = A and x = B. by comparing which of them is larger, we can decide whether the signal emitted tonight is a or B. P (y|x) is called the probability of Y under a given x (technically called likelihood ratio).

This is a bit brain-burning, otherwise Bayes would not have become a universally recognized mathematician, and the famous Bayesian formula could not have become a cornerstone of artificial intelligence today.
To put it simply, Bayesian formula is to calculate the possible a posteriori probability on the basis of the known a priori probability, and provide a basis for decision-making according to the magnitude of the posterior probability.
After a hundred years of development and improvement, Bayesian formula has become a set of theories and methods, and has its own family in the field of probability and statistics.
In 1742, Bayes was admitted as a member of the Royal Society because of his outstanding research achievements. His two books, the solution to the problem of opportunity and the introduction to the Theory of opportunity, have been widely valued and influenced since his death.
Artificial intelligence technology has always been divided into two camps: symbolic school and statistical school.

The symbolic school explores the methods of artificial intelligence with rules and logical reasoning as the core, while the statistical school uses the methods of big data and probability statistics to explore the realization of artificial intelligence, and its core is Bayesian theorem.
Let’s take a look at how Bayes helps us through a small example of machine translation.

Suppose we want to translate such an English sentence “John loves Mary” into French “Jean aime Marie”.
We use e for “John loves Mary”, and the preferred f to be examined is “Jean aime Marie” (French).
We need to ask how likely it is to translate e into f, that is, P (e | f).
For this reason, we consider how many possibilities there are for e and f to align.

What is alignment?
It is how many French words can be opposite to each English word.
English “John” corresponds to French “Jean”, English “loves” corresponds to French “aime”, and English “Mary” corresponds to French “Marie”. These are one of the (most reliable) possible alignments.
Why align?
Because once aligned, it is easy to calculate how big the P (e under this alignment is | f), that is, calculate P(John|Jean)×P(loves|aime) ×P(Mary|Marie).
Then, we walk through all the alignments and sum the translation probabilities under each alignment to see how big P (e | f) is.

This is the method of statistical machine translation, because it is simple and can be calculated automatically (without adding rules manually), so this method has quickly become the de facto standard of machine translation.
The core of statistical machine translation is Bayesian method.
In 1988, Professor Judea Pearl, an Israeli-American computer scientist and philosopher, invented Bayesian networks on the basis of Bayesian theorem.
It is a kind of inference network based on probability uncertainty, which is used to represent the connection probability of variable sets, and provides a method to represent causal information.
We already know that Bayesian formula provides a tool for predicting the probability of events, but causality is not included in probability prediction, and it is very difficult to analyze causality.
For example, many people on the road carry umbrellas to indicate that it may rain, but umbrellas are not the cause of the rain.
When you want to change the outcome of an event, it is very important to understand causality, for example, to stop a rain, it is impossible for ask everyone to don’t carry an umbrella.

Professor Judea’s Bayesian network provides a probabilistic method including causal analysis, which enables computers to work in complex, fuzzy and uncertain environments. It is one of the most effective theoretical models in the field of uncertain knowledge representation and reasoning.
Bayesian network has been widely used in many fields, such as natural language processing, fault diagnosis, speech recognition and so on. as a result, Professor Pearl won the Turing Award in 2011 and became another artificial intelligence scholar who won the Turing Award.
Today, the Internet of things is constantly entering a new era of various industries, and countless sensors in smart homes, self-driving cars, smart phones and smart cities are generating billions of bytes of data all the time. Artificial intelligence algorithms based on statistical mathematical models such as Bayesian networks have become the cornerstone of these applications.

--

--