Mathematics for Bayesian Networks — Part 1

9 min readSep 18, 2024

Introducing the Basic Terminologies in Probability and Statistics

“What’s the most you ever lost on a coin toss?” — No Country for Old Men

I have a confession to make — I find it super difficult to remember terminologies. In real life I am sort of bad with names, combine that with working in a field that is so full of terminologies and corporate jargon that things can get a bit … complicated. If you’re like me, then you’ve stumbled onto the right article because I have listed down some of the more important probability and stats terms that you may come across while reading an AI/ML research paper, especially the ones dealing with Bayesian statistics in some form.

The objective of this article is to consolidate these terms with easy-to- remember explanations and examples. On a personal front, the probability and stats series will serve as a build-up for my next upcoming article on Variational Autoencoders.

Mathematical Models

In simple terms, a mathematical model is just an abstract description of real-world phenomena or actual objects using mathematical operations, equations, variables,etc. That’s all it is.

There are different types of mathematical models and how you classify a mathematical model depends on the complexity of the problem, goal, available information (parameters), processes (operations) etc. but the diagram below lists down some of the different categories of mathematical modelling.

Deterministic vs Probabilistic

Deterministic models are based on precise inputs and outputs, for the same set of inputs the outputs of a deterministic model will always be the same. A calculator app can be an example- if you add 15 and 10 on your phone’s calculator app it will give you 25 even if you perform the addition 1000 times (if it gives anything apart from 25 you should probably install a new calculator app).

Probabilistic modelling is all about embracing the chaos and randomness around us. Weather forecast, match forecast, election results forecast — all probabilistic. If you want to understand the math powering generative AI, you need to get a good grasp of probabilistic modelling.

Common terminologies used in Probabilistic modelling

Random process:

Any process (or formally “phenomenon”) where we are unaware of the outcomes ahead of time. For example, the coin toss in the beginning of cricket match — you don’t know if the result of the toss will be heads or tails before the toss, right? You don’t, unless you are a psychic and psychics don’t exist … they don’t, right? But you’re mostly not a psychic and that’s why you are here reading about math so there’s that.

Random variables:

This is what it’s all about — the whole point of probability theory is to describe the behaviour of these pesky little guys. A variable in programming terminology can hold different numeric values, so random variable is just another variable that holds a number that equates to the probability of an outcome of a random process. In context of our example, let’s say we have a random variable X that equals to 1 if the toss results in a head or 0 if it results in a tail.

Discrete and continuous random variable

Now this toss can result in only 2 outcomes — either a head or a tail, so we have 2 discrete countable outcomes of our random process of tossing a fair coin. So, in this case, our random variable X is a discrete random variable. In case we are trying to predict something that can have many different possible options like measuring the height of someone, we will call X a continuous random variable.

Probability distribution:

Under this heading we have two types of distributions — discrete and continuous.

Discrete probability distributions:

Let’s say you are super bored on a Friday night and you decide to play a game of predicting numbers with your roommate. The rules are the game are as follows,

- Your roommate picks up a random number between 1–100 in their mind and doesn’t tell you

- You pick up a random number between 1–100 and you share your selection with your roommate

- If you pick the same number as your roommate, then they buy you a beer, or a soda if you don’t drink, whatever you prefer…

- And assume this game is fair, i.e., you guys don’t have any favourite numbers and you’re equally likely to pick any of them

If this game sounds boring and rather lame, then it makes the two of us. But for the sake of proving a point, let’s continue. Since your roommate may pick any of the numbers equally, the probability of any one of them being picked up is 1/100. Now you may want to create a graph displaying this information.

We still have a finite number of options in this case, so this is an example for discrete probability distribution.

Continuous probability distribution:

Now let’s say you did win this silly little game of predicting a number and your roommate takes you out for a beer. You head down to the nearest beer shop and your roommate who is picking out a beer for you suddenly turns around with a can of lager (1 pint to be precise) and asks, “Guess how much this costs?”. What will you pick? A pint of beer sold in cans ranges between Rs 100 to Rs 500 depending on brand, manufacturing location, ABV, hops etc. If I start including alcohol tax rates on top of that, giving an exact number is going to be very difficult. Now your roommate considers making it more complicated and says you have to predict the price in Rupees and Paisa and you give up — there are infinite options to consider and you can’t win this one. If you want to put this in a graph, with all the different possible values, the graph will look like the one below. This is an example of continuous probability distribution.

Discrete or continuous, we use probability distributions to represent our uncertainty about a particular event/outcome. In case of discrete probability, the sum of all the probabilities (of selecting every individual number) is 1. For the case of continuous probability, we will perform an integration instead of summation.

A probability distribution is considered to be valid if:

All the individual probabilities are non-negative
The sum of individual probabilities is 1 (or the area under the probability density curve is 1)

Probability densities:

For the discrete scenario mentioned earlier, if you want to compute the probability of a range of numbers, for example- probability that your friend picks up one of the first 5 numbers, you just have to add the probabilities of each of the numbers getting picked up.

P(1≤X≥5) = P(X=1) + P(X=2) + P(X=3) + P(X=4) + P(X=5) = 0.01+0.01+0.01+0.01+0.01=0.05

But for the continuous case, how do you calculate the probability that the price of the beer can is between Rs 100–200? You have an infinite number of prices to consider and even assuming that each of these prices are equally probable doesn’t help much in terms of calculation. This is where the concept of probability density comes handy. We’ve already talked about how we perform integration for sum of all probabilities. Similarly for calculating the probability that the price of the beer can ranges between Rs 100–200 we will calculate the integral for the desired range, which turns out to be the area under the curve between 100 and 200 as shown below.

*Probability density — area under the curve*

Mean of a distribution:

Mean is a popular way of summarising a distribution. Mean, also known as expected value measures the centre of the distribution and is calculated as a weighted sum for discrete random variables and as an integral for continuous random variables.

Here Pr denotes probability and p denotes probability density.

Using this, we can calculate the expected value for the game as,

(Not the best handwriting!)

Marginal probability:

My husband and I occasionally meet our friends for a game of badminton. We play separate singles matches but we’ve noticed that if I win a match, my husband also ends up winning a match later in the evening and vice versa. Based on our win-loss record, we created this probability distribution table,

*Probability distribution for our badminton matches*

From this table we can see that probability of both of us losing is 0.3 and both of us winning is 0.5, the probability that he loses and I land a win or I lose and he wins are both at 0.1. The sum of probability of all the outcomes = 0.3+0.1+0.1+0.5 = 1.

Let’s say we are playing a match with our friends after a long hard day and I just need a win no matter what. In this case, I will calculate the marginal probability distribution for my wins.

How do I calculate this?

Well, there are two ways I can win — either I win and my husband loses or both of us win. The probability of me winning a match after my husband has lost is 0.1. The probability of the both of us winning is 0.5. So marginal probability of me winning = 0.1 + 0.5 = 0.6

Note: An alternative way of doing this would be using Venn diagrams.

*Venn diagram approach for marginal probability. Image credits: A Student’s Guide to Bayesian Statistics (Reference 1)*

Conditional probability and joint probability:

Formally speaking, this is the case where we have information about one variable and we want to use this to update our uncertainty about the other variable.

Going back to our badminton match example, let’s say that my husband wins the first match of the night and I want to use that information to calculate the probability that I win my match later.

How do we compute this?

The expression for computing the conditional probability of A given B is,

In the equation above, p(A,B) refers to the joint probability that both A and B occur and p(B) is the probability that B occurs no matter what, which is the marginal probability distribution term discussed earlier. Modifying the terms with the specifics of the badminton match we get,

Dependent and Independent Events/Variables:

We call two variables dependent if there is a relationship between the two of them and the relationship doesn’t necessarily have to be causal. Listing down a few examples below,

- Let two variables be the colour and flavour of ice creams. If I tell you the colour of the ice cream is green, it’s probably kesar pista, paan or matcha. If I tell you that the colour is brown, it’s probably chocolate or coffee. Information about the colour gives away information about the flavour

- Disjoint variables also fall under this category. Disjoint as in, if one of the events happen, the other cannot happen. For instance, if the colour of the ice cream is white, there is no way that the flavour is chocolate or coffee.

When two variables have no relationship with each other and the occurrence of one doesn’t impact the other in any way, we call them independent variables. For example, let’s say we have two events- I win a badminton match and I order a chocolate ice cream. These two events have nothing to do with each other, I might want a chocolate ice cream before a match, after a match, after winning or losing a match. So, these two events are independent events.

Note: Conditional probability has nothing to do with dependence or independence, it is applicable for both cases.

This is where we end part one and I’ve covered all the essential terms needed to understand Bayes Theorem. In Part 2, we will have a detailed look at Bayes’ Theorem and start building up the concepts essential for understanding Generative AI algorithms, especially Variational Autoencoders!

References:

1) A Student’s Guide to Bayesian Statistics by Ben Lambert: https://sites.math.rutgers.edu/~zeilberg/EM20/Lambert.pdf

2) Mathematics for Machine Learning by A. Aldo Faisal, Cheng Soon Ong, and Marc Peter Deisenroth: https://mml-book.github.io/book/mml-book.pdf

3) Concrete Mathematics by Donald Knuth, Oren Patashnik, and Ronald Graham: https://github.com/djtrack16/thyme/blob/master/math/Concrete%20Mathematics%20A%20Foundation%20of%20Computer%20Science%202nd%20Edition.pdf

4) An introduction to Probability and its Applications by William Feller

Mathematics for Bayesian Networks — Part 1

Written by Mohana Roy Chowdhury