Mathematics for Bayesian Networks — Part 2
Introduction to Bayes Theorem
Welcome back to Mathematics for Bayesian Networks!
In the previous part (Link: https://medium.com/@mohanarc/mathematics-for-bayesian-networks-part-1-52bdf24829cc) introduced all the terminologies involved in this series. Today we will talk about the star of the show — Bayes Theorem. While we won’t be looking at any algorithms that use Bayes Theorem in this particular article, trust me, these are one of the most important concepts you should be comfortable with to understand the workings of AI/ML algorithms in depth.
Probabilistic modelling
Let’s say you’re playing a badminton game with your friend in an indoor court and you guys toss a coin to decide who gets to serve first. Normally the probability of landing a head for a fair coin can be assumed to be 0.5 but there are a bunch of hidden assumptions to consider. To assign 0.5 as the probability of landing head/tail, we have to ignore a number of factors that can impact our toss. For instance, the angle of the toss, height of the toss, the engravings in the coin that affect the weight distribution of the coin etc. Lots of uncertainty involved in a simple coin toss, which makes the process probabilistic, which implies you can keep tossing the coin and observe different outcomes.
Statistical modelling (which can be loosely defined as a set of probabilistic models) has two approaches:
- Frequentist inference (or Classical inference)
- Bayesian inference.
Frequentist vs Bayesian
Going back to the badminton match, you’ve talked to your friend about all the implications of ignoring the external factors that affect the coin toss but your friend was adamant about assuming 0.5 as the probability of heads so that’s that.
Your plan was to play 10 matches so you’ve tossed the coin 10 times and you ended up with heads 8 out of the 10 times. Weird, right? Wasn’t the probability of you landing heads 0.5?
What would statistics say about this observation?
- If you go by the frequentist approach, you will conclude that you’ve picked a slightly odd sample from the population of infinitely many repeated throws and if you toss the coin 10 more times you might get a different result because you’ll pick a different sample
- If you go by the Bayesian approach you don’t have to bother about infinitely many throws. In Bayesian inference, there is no infinite number of possible samples and the results of the toss are not a result of a random sampling process. “In Bayesian statistics, the probability of landing a head simply quantifies our belief before tossing the coin, that on tossing the coin it will land as a head.”
That last line was confusing! But I can’t explain that before we get into Bayes theorem so let’s get started with Bayes theorem for now, I promise we’ll talk about it again!
Bayes Theorem
The expression for Bayes theorem with the names of all the terms is shown below,
Bayes Theorem and Coin Toss
Back to your badminton match now. Let’s say you and your friend felt extra enthusiastic one night and you decided to note down the results of your coin toss. Before tossing you had a prior belief that the coin is fair, i.e.,
p(heads) = p(tails) = 0.5
Here p(heads) is the probability of landing head and p(tails) is the probability of landing tail.
This prior belief is the same prior as the one labelled in Figure 1.
Now you start tossing the coin and your observed result is identical to the previous example, 8 heads for you out of 10 tosses (assume it was merely a coincidence). This is our data, the same data mentioned in the expression of Figure 1. Using Bayes theorem, we combine our prior belief with our observations to update our belief that the coin is fair (probability that the coin is fair). This updated entity is our posterior.
Notice how the terms are close to the literal meaning, i.e., prior is the term before and prior means before, posterior is the term post update using data and posterior means after.
In Bayesian inference, we use Bayes’ theorem to estimate a probability distribution for the unknown parameters after we observe the data. Going back to the sentence “In Bayesian statistics, the probability of landing a head simply quantifies our belief before tossing the coin, that on tossing the coin it will land as a head.”, we can simply it by saying that using Bayes theorem we can calculate the probability of the toss landing a head using our observation that the coin toss indeed results in a head.
Still not clear? That’s alright, Bayes theorem does have a backward approach but it will start clearing up once we discuss all the terms and look at more examples.
Unboxing the components of the Bayes Theorem
Parameters
Parameters are just characteristics that interest us — for instance, when my desired outcome is the toss resulting in a head, the parameter will be just that. In Figure 1 and everywhere else in the article, the parameter is denoted by A.
Prior
In Figure 1, p(A) is the prior. It is our belief about the different values of our parameter/parameters before we have made our observations.
Let’s take an example. Someone you know is visiting the doctor to get themselves tested for dengue. They have a few symptoms — exhaustion, fever, etc., so they just want to be safe by getting tested early. They’ve inform the doctor about a recent visit to a tropical destination and they remember getting bitten by mosquitoes there, so, based on the doctor’s experience in handling dengue cases, the information about the recent travel, and the visible symptoms, the doctor says that there’s a 60% chance that the reports (note: observed data) will be positive for dengue. The reports aren’t out yet, but the doctor is experienced enough to comment on the possibility. This is the prior belief, p(A) that the patient will test positive for dengue, which is the probability prior to data collection, quite literally.
Prior is a valid probability distribution and using Bayes’ rule, we update the prior based on the collected data.
Likelihood
In Figure 1, at the right side of the expression, the term p(data|A) is known as the likelihood term. Once we establish that the parameter in our model is A (or heads in case of the coin toss example), the likelihood term quantifies the probability of generating that particular sample of data. In Bayesian inference, we keep the data constant and vary the parameter A and then we calculate the posterior probability for different values of A. For example, in the case of the coin toss example, our initial assumption was that p(head) is 0.5 but what if p(head) was 0.2 or 0.7 and we still landed with heads in 8 out of 10 tosses?
Let me explain this with a simple example. We flip a coin twice with the probability of landing a head being ambiguous and denoted by A. We want to calculate the probability of all possible cases when we land head. In Bayesian inference, toss the coin a few times, note the outcomes to estimate a posterior belief in different values of A. We know that,
p(heads)=A
So, p(tails)=1-A
Let X be the number of heads. There are 3 possible outcomes of the toss
- We get tails twice, so X=0
- We get one heads and one tails, so X=1
- We get heads twice, so X=2
For X = 0, i.e., we land tails twice,
p(X=0|A) = p(tails, tails|A)
= p(tails|A) X p(tails|A) [Substituting p(tails|A) = 1 — A]
= (1-A)² … equation 1a
For X = 1, i.e., our outcome includes on head either on the first or the second toss
p(X=1|A) = p(heads, tails|A) + p(tails, heads|A) [head on 1st or 2nd toss]
= (p(tails|A) X p(heads|A)) + (p(heads|A) X p(tails|A)) [Both are same]
= 2 X (p(heads|A) X p(tails|A)) [Substituting p(heads|A) = A, p(tails|A) = 1 — A]
=2A(1-A) … equation 1b
For X = 2, i.e., we land heads twice,
p(X=2|A) = p(heads, heads|A)
= p(heads|A) x p(heads|A) [Substituting p(heads|A) = A]
= A²… equation 1c
We’ll assume that the probability of landing a head is limited to one of the 6 values: 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0. We compute the values for X=0, 1, and 2 for all values of A and put it up in a table,
Along each row, the value of A is fixed. These are the 3 possible outcomes of two toss for a particular value of A. This is a valid probability distribution as the sum total of all the possible outcomes is 1.
Along each column, we have the likelihood. We vary the values of A while keeping the outcome (data) fixed. Think of it this way, we have out of 10 tosses, 8 turns out to be head. We assume that the probability of landing head can be 6 different values ranging from 0.0 to 1.0 and sum them all up. This is the likelihood term p(data|A). The sum of this is greater than 1 so it cannot be called a valid probability distribution and we need something to factor it out so that our posterior turns out to be valid.
Marginal
We just talked about how the likelihood term in the numerator isn’t a valid probability distribution and we need something in the denominator to normalise it. The denominator term p(data) is a marginal probability density (Link to part 1 describing Marginal: https://medium.com/@mohanarc/mathematics-for-bayesian-networks-part-1-52bdf24829cc) that we’ve been waiting for!
We compute p(data) slightly differently depending on whether or not the parameter A is discrete or continuous.
For discrete, we use summation and the expression for Pr(data) is:
For continuous, we use integration and the expression for p(data) is:
In most real-life applications, likelihood will have multiple parameters and the expression may look pretty complex like shown below,
Discrete:
Continuous:
Especially for continuous probability, calculating this integral might get very complicated as we increase the number of parameters A, the number of integrals will increase. After a point it is too complex to calculate but we need the marginal term else our posterior will be invalid! This heading will be continued once we cover a few more topics.
Posterior
All of this effort just to calculate this guy i.e., p(A|data)
Using Bayesian statistics, we can use the data we have and extrapolate backwards to comment on the parameters that were responsible for generating the data. For example, we can use the results of the coin toss and infer if the coin was biased or not.
An alternative way of viewing the posterior is by describing the events in terms of cause and effect, Using this, the expression for posterior can be written as,
A more commonly used version of Bayes theorem interprets the posterior as a combination of information from past events combined with the information from observed data. The posterior usually has lesser uncertainty than prior because we have updated it using the observed data. In this case the expression looks something like this:
We will continue Bayes theorem in the next article as well. Coming up — derivations and examples to solidify the foundations!
References:
1) Classification of mathematical models: https://medium.com/engineering-approach/classification-of-mathematical-models-270a05fcac4f
2) Introduction to probability theory and its applications by William Feller: https://bitcoinwords.github.io/assets/papers/an-introduction-to-probability-theory-and-its-applications.pdf
3) A student’s guide to Bayesian statistics by Ben Lambert: https://ben-lambert.com/a-students-guide-to-bayesian-statistics/
4) Frequentist vs Bayesian approach in Linear regression: https://fse.studenttheses.ub.rug.nl/25314/1/bMATH_2021_KolkmanODJ.pdf
5) Mathematics for Bayesian Networks Part 1: https://medium.com/@mohanarc/mathematics-for-bayesian-networks-part-1-52bdf24829cc