3. Probability for Machine Learning — The Basics

Rithesh K
9 min readApr 20, 2024

--

This is a part of the Machine Learning series.

  1. An Introduction to the World of Machine Learning
  2. First Look into Machine Learning
  3. This post

Probability: A review

When talking about the data in the previous post, we saw how the data could be noisy due to measurement errors or the noise in the environment itself, or it could be biased. These show that the data we collect has some uncertainty, no matter how well we clean it. And a perfect tool to handle uncertainty is Probability Theory. So, let’s discuss it a little and see where we can apply them in machine learning.

We start with a couple of (informal) definitions:

  • Outcome space or Sample space: A set containing all possible outcomes for an activity. For example, if you are trying to find some probability around a die, then the sample space is the set of all the faces of a die — {1, 2, 3, 4, 5, 6}. If the problem is around 2 dices, the sample space = {(1, 1), (1, 2), (1, 3), …, (6, 6)}.
  • Event: What you are trying to find the probability of. It could be getting an odd face when you throw a die. Or to get a royal flush in poker (which is super, super rare). Defining it formally, it is a subset of sample space — the outcomes that you want. For example, Event - getting an odd face in a die = {1, 3, 5}.

What we need to find is the probability of an event given a sample space. The probability values range from 0 — no chance the event will happen (like getting the number 7 on a roll of dice) to 1 — the event will definitely happen (like you getting hurt if you hit your head to a wall unless you’re a superhuman, of course). A simple way to calculate the probability is to count the number of outcomes in the event and divide it by the count of total outcomes of the sample space.

For example, the probability of getting an odd number on a roll of dice = count({1, 3, 5}) / count({1, 2, 3, 4, 5, 6}) = 0.5. This means we have a 0.5 chance (a 50% chance) to get an odd number on a dice roll. We conventionally denote the probability with P. So, P(odd face on a dice) = 0.5.

Let’s take an example where you want the probability of getting 5 Heads when you toss a coin 8 times. And you want the probability of 4 Heads on 8 coin tosses, 3 Heads, and so on. Previously, we would write it as P(5 Heads on 8 coin tosses), P(4 Heads on 8 coin tosses), …, which is tedious. Instead, let’s use a variable X as the “number of Heads on 8 coin tosses”. It’s a bit abstract coz we haven’t specified how many Heads. We can now write P(5 Heads on 8 coin tosses) as P(X=5). Much simpler.

This variable X is called a Random Variable. This has another advantage. We can now look at outcomes as just numbers the random variable can take. In the above example, if we change the value of X to 4, we get a different outcome — 4 Heads on 8 coin tosses. We are essentially mapping each outcome to a number using the random variable.

  • Random Variable (X): A function (or a mapping) that assigns a real value to each possible outcome of an event.

The random variables can be discrete, taking specific countable values like the above example where X can only be 0, 1, 2, …, 7, 8 (you can only have from 0 to 8 Heads in 8 coin tosses). The random variables can also be continuous; for example, if you need to find the probability of a man’s height (using a random variable Y), there could be a lot of values for a man’s height. And we might have to find P(Y = 170.2cm). It is not very intuitive to understand continuous random variables, but it gets more and more clear as we go deeper.

Discrete Random Variables

Let’s stick with discrete random variables for now and try to find those values. Let’s find the value of P(X=5) where X is the number of Heads in 8 coin tosses. From permutations and combinations (which I won’t be covering here, as that in itself is a huge topic), we find the number of outcomes with 5 Heads in 8 coin tosses is C⁵₈ = 56. And the total number of outcomes in 8 coin tosses is 2⁸ = 256. So P(X=5) = 56/256 = 7/32.

Similarly, we find the probability values of other outcomes of X. We get:

P(X=0) = P(X=8) = 1/256

P(X=1) = P(X=7) = 8/256 = 1/32

P(X=2) = P(X=6) = 28/256 = 7/64

P(X=3) = P(X=5) = 56/256 = 7/32

P(X=4) = 70/256 = 35/128

Now that we have the probabilities of different counts of Heads in 8 coin tosses, you might ask a further question — if I know there will be 8 coin tosses, what can I expect the number of Heads to be? We cannot exactly know the number of Heads beforehand, but we can find out what number of Heads it is most likely to be. This is known as the Expectation of a random variable.

  • Expectation (E(X)): The long-term average value of a random variable. It is the value of the random variable you expect to get the most if you do the experiment an infinite number of times.

We have defined the expectation of outcomes as an average, which is a math operation. This is possible because we have mapped the outcomes to a number using a random variable. We also need to prioritize the outcomes based on how likely they occur. We can do that by multiplying the outcomes by probability and adding them to get the expectation.

Equation 1: The expectation E of random variable X

In our example of X: number of Heads in 8 coin tosses, the expectation of X comes out to be:

E(X) = 0*(1/256) + 1*(8/256) + 2*(28/256) + 3*(56/256) + 4*(70/256) + 5*(56/256) + 6*(28/256) + 7*(8/256) + 8*(1/256)

E(X) = 4

We got the expectation to be 4. This makes sense, as we get 4 Heads more often than others in the long run. Also, the other probabilities are evenly distributed on either side of 4 (X=3 and X=5 have the same probabilities, X=2 and X=6 have the same, and so on), maintaining the balance of probability on 4.

So, if we have an average of random variables as an expectation, do we also have a variance of random variables? Let’s recall that the variance of a set of values tells us how the values deviate from the mean. A high variance value means the data points are too spread apart from the mean. In a similar way, we can define the variance of a random variable to see how the other occurrences are spread apart from the expected occurrence (the “mean” of the occurrences).

We defined the variance in statistics as the average of the squares of differences between the values and their mean. In a similar way, we define the variance of random variables as the expectation of the squares of differences between the random variable and the expected occurrence.

Equation 2: The variance Var of random variable X

A higher variance value means the probability of occurrences away from the expected occurrence is high. So that means even if you have an expected occurrence, there’s a good chance you might get a different occurrence if you do the observation.

We have seen the various probabilities for a random variable along with its expectation and variance. Can we see it as a graph? Yes, we can! In statistics, we have a frequency distribution graph and a histogram (bar chart) of values vs their frequency in the set. Here, we already have the probability of the occurrences, so we can plot them directly. This graph is called a probability distribution. And if the random variable X is discrete (like in our example case), we call this the probability mass function.

  • Probability Mass Function (pmf): A function that measures the probability of observing a discrete outcome x. Plotting the values of pmf for all the outcomes of a random variable gives us the probability distribution.
Figure 1: The probability distribution of the random variable “number of Heads in 8 coin tosses.” The sum of the probabilities should be equal to 1.

Figure 1 shows the probability distribution of our use-case “number of Heads in 8 coin tosses.” The probability of outcome 4 is the highest, and the distribution on either side is symmetrical. Also, there is a very nice dome-like shape in the distribution. This has a special name called Binomial distribution, and we will discuss this in future blogs.

Continuous Random Variables

All this while, we saw some properties of discrete random variables and their probability distribution. Can we do the same for continuous random variables? Yes, we can, but first, we must ask ourselves a much simpler but crucial question — what does probability even mean for continuous random variables? After all, if we consider real numbers, there are infinite values in any given range. The probability of us getting that one particular number out of an infinite collection is just zero. So… no probability?

Yes, we cannot talk about the probability of one value of the random variable. But we can talk about the probability over a range of values. We can find the probability of the random variable being in a range of values.

Let’s take an example of this. Suppose you want to guess the height of a man. When you make a guess, saying he’s 175cm, you don’t imply that he is 175.00000… cm with exact precision. You mean that he is around 175cm with some high precision (even if he is 174.98cm or 175.02cm, we consider him 175cm). Or when you say your own height, you measure to a certain degree of precision. So, we are already looking at a small range the random variable can take and not an exact continuous value.

In a similar way, we can find the probability that a random variable is in that range of values. This is called the likelihood of that variable in the given range. In the previous example, we were looking for the likelihood of a man’s height being around 175cm. You get more precise information on the probability distribution as you reduce the size of these ranges until we do it for the smallest range possible around the value. If you find the likelihood for all the values in the outcome space, you get the probability density function.

  • Probability Density Function (pdf): A function that measures the likelihood of observing a continuous outcome x.
Figure 2: The probability density distribution of heights of 14-year-old girls. The area under the curve must be equal to 1, even though the density values can be greater than 1.

Figure 2 shows a pdf of the heights of 14-year-old girls. The density values of some points have exceeded 1, but the total area under the curve will always be 1 (as the total probability is 1). We can extend the definition of expectation to continuous variables with the help of integrals.

Equation 3: The expectation of a continuous variable. Instead of summation, we use the integrals.

With the expectation defined, we can use the same variance formula here, too.

In future blogs, we will continue the discussion on probabilities, focusing on a few special probability distributions and Bayes’ estimates. For now, I feel that this is good enough as we finally enter the core of Machine Learning in the next blog. We will start with one of the simplest and most fundamental machine learning models, the Linear Regression model.

References

Bishop, Christopher M. Pattern Recognition and Machine Learning. New York: Springer, 2006.

--

--

Rithesh K

An AI Enthusiast trying to explore the world with the love for Mathematics.