Probability and Statistics for Computer Vision 101 — Part 1

Sho Nakagome
sho.jp
Published in
11 min readSep 7, 2018

I believe understanding fundamental concepts is crucial when it comes to learning something advanced. Why? Because the fundamentals are the basis where you build your advanced knowledge on top of. If you put more things on top of the weak basis, it could break apart in the end, meaning you end up not fully understanding any of the materials you learned. So, let’s try to understand the fundamentals deeply.

In this series, I’m going to explain about probability and statistics for computer vision. Mainly from the textbook: “Computer Vision: Models, Learning, and Inference” by Dr. Simon J.D. Prince. The textbook is awesome and to the point. The website below has all sorts of materials from PPT slides for each chapter, PDFs, etc. I strongly recommend you actually go through the textbook in order to understand the materials. In this series of my articles, I’m going to summarize and add some extra ideas for you to grasp the concepts easily.

Materials covered in this article:

In the textbook, it’s mainly from Chapter 2.

  • Probability
  • Random Variable (RV)
  • Probability Density Function (PDF)
  • Joint Probability
  • Marginalization
  • Conditional Probability
  • Bayes’ rule

Introduction to Probability and Random Variables

I’m sure you have learned about probability at some point. We also use it unconsciously in real life when making decisions. If you thought you are most likely succeed in the decision you are trying to make, you will go for it. Otherwise, you won’t. It’s an interesting field to study, but it could also be tricky sometimes. So in this section of the article, let’s recap what probability was and introduce you to the concept called “random variable”.

Let’s say you have a card in your hand. And you are about to drop the card. What’s the probability of the card being facing up when it lied on the ground?

Probability is often represented by % in real-life (like 80% of raining chance), but when we are dealing with probability in math, we often represent them using decimals (e.g. 0.5 for 50%). Probability a measure of likelihood that an event will occur, represent by numbers between 0 to 1.

Random Variable (RV) is a variable representing the outcome of our interest. See the example above when we talked about dropping a card. RV of x is used to represent the state of a card. It only has 2 states, either facing up (x = 1) or facing down (x = 0). If you are familiar with programming, it’s just like any other variables you would use. RV could change it’s state every time the event occurs. So if you dropped a card 5 times, x could be [0, 0, 1, 0, 1]. This gives us the temporary probability of the system. Since we have 3 facing down (x = 0) out of 5 trials, the probability of a card facing down is Pr(x=0) = 3/5 = 0.6. On the other hand, the probability of a card facing up (x = 1) is 2 out of 5 so it would be Pr(x=1) = 2/5 = 0.4. Note that the sum of all the states (x = 0 and x = 1) is 1 and this is always the case if you sum all the possible states.

Introduction to Probability Density Function (PDF)

Now you know what probability and random variable are. Let’s talk about Probability Density Function (often abbreviated as PDF, not the file type you often use!).

Remember that I showed you that the sum of all the states is always 1? This is very important thing to remember in probability because it’s a very useful property to rely on.

If I were to visualize the example of card dropping in a bar graph, it would be something like this:

This is a discrete version of the probability density function. Let’s say the width of each bar is always 1. Then the area for Pr(x=0) is 0.4 * 1 = 0.4. Same goes for the Pr(x=1). Note that the sum of all the areas are always 1. This doesn’t change if you move to a continuous case.

Let’s recap what’s continuous and discrete here.

Both the plots are representing the same distribution. The only difference is whether it’s continuous or discrete. In data science, especially when we are handing data with programming, it’s more likely that you are dealing with discrete data where you have multiple rows and columns, each cell containing a single data point.

Introduction to Joint probability

We talked about Probability and probability density distribution as well as continuous and discrete data representations. Now let’s step a little bit further into the probability and talk about “Joint Probability”.

Let’s take a look at a simple example below. Suppose you have 2 random variables x and y, x representing whether it rains or not, y representing whether you have an umbrella or not. Assume that you know the probability on each and you have something like the following:

Currently, each of those two conditions are separated from each other. But we want to know the probability when they are combined. That’s where the “Joint Probability” comes into play.

Let’s take an example. What’s the probability that it rains and you have an umbrella? (Thank god you have one!) This is our Case 1. We have a joint probability of Pr(x=1, y=1), x=1 meaning the chance of raining, y=1 meaning you have an umbrella. The Case 2 is a worst case scenario. It rains and you don’t have an umbrella. The joint probability is Pr(x=1, y=0).

So by going through the above examples, I hope you got a slight idea of what the joint probability is. In a more general term, a joint probability is a probability that calculates the likelihood of two (or more!) events occurring together and at the same point in time.

Just to give you another perspective, let’s try to visualize what joint probability is. Suppose you have 2 random variables (x and y) and want to know it’s joint probability visually. It looks something like below as an example.

Think of this like a contour map. The dark area (towards black) is at the bottom and as the color gets lighter (towards yellow), the altitude gets higher as well. Essentially, you are trying to understand the landscape (which is 3D) by looking at a 2D map.

So what does this tell us when it comes to joint probability? One important thing is that the whole area surrounded by the black square box is always equals to 1. Remember when we talked about probability density function (PDF)? Where the area has to be always 1? It’s the same here. Since we are talking about probability, the sum of all the possible joint probability here has to be 1. Another aspect is that the altitude. You can see that there’s a blight yellow section near the center towards slightly up to the right. This is where the joint probability of Pr(x, y) is the highest, meaning a particular pair of x and y would yield higher probability for a certain event to occur.

You could also see different kinds of examples of visualization in Dr. Simon Prince’s website where he has tons of slides:

http://web4.cs.ucl.ac.uk/staff/s.prince/book/02_Intro_To_Probability.pptx

As usual, here’s another link to help you understand about the joint probability:

Introduction to Marginalization

OK, now you know the joint probability. Let’s talk about “Marginalization”. Marginalization is a way to go from joint probability to the normal probability that we handled in the beginning. Suppose you only had the joint probabilities:

What we are interested now is the individual probabilities such as Pr(x=1) or Pr(y=0). How can we calculate that from these joint probabilities?

The thing is, it’s quite easy. Here’s the formula for both continuous and discrete cases:

It’s just summing all the possible states for the random variable you are not interested in. Let’s see an example to understand this. Suppose you want to get Pr(x=0), but you only have joint probabilities Pr(x=0, y=0) & Pr(x=0, y=1). To get Pr(x=0), you perform marginalization by doing the following:

So what’s the intuition behind this calculation? Marginalization is trying to think about all the possible situations for the state you are interested in. Think about getting Pr(x=0). The problem was that you have joint probabilities Pr(x=0, y=0) and Pr(x=0, y=1). But if you think about it for a moment, for a specific state that we are interested in (x=0), there’s only two situations. Whether y=0 or y=1. That’s why summing the probability of Pr(x=0, y=0) and Pr(x=0, y=1), we are accounting for all the possible situations for Pr(x=0), thus getting the probability.

Introduction to Conditional Probability

Now the topics will get a bit complicated, but at the same time, more interesting! Let’s talk about a very important concept called “Conditional Probability”.

Let’s go back to the raining and umbrella example once again. Previously, we were considering each state as independent, meaning one condition didn’t affect the other. It was just whether it rained or not and whether you had an umbrella by any chance. However, in a real world scenario, you would behave better. I mean, if you know that it rains in the afternoon, wouldn’t you bring an umbrella with you? If that’s the case, given the condition that it might rain in the afternoon (assumption), the probability of you bringing the umbrella increase. This is where conditional probability comes into play.

So the conditional probability looks like the following:

where you divide the joint probability by the probability of the state where you are basing your decision on. But what does this equation mean? What does it mean to normalize?

Going back to the visualized joint probability contour plot, we can explain why this equation makes sense. I already explained that the joint probability looks like the contour map on the left. Conditional probability is restricting the situation given a certain condition (y = y*). In this scenario, you are not thinking about all the possible y’s, but only the particular y=y* to base your decision (like when you are trying to think about bringing your umbrella given that you already know that it might rain in the afternoon).

Given y=y*, we could simply take the slice out of the contour map and use it as our conditional probability. But there’s 1 problem. The area of a slice is not 1. Why? Remember that all the possible joint probabilities in this area has to sum into 1 when I explained about it earlier? That’s why taking a slice won’t give you 1 since you are ignoring all the other possible joint probability cases. Therefore, to make this slice a PDF, we need to normalize the area so that this slice area becomes 1.

If it’s a continuous case, the area is represented using the integral as above. By using this, we can calculate the conditional probability like this:

So this is why the conditional probability looks like this in a simplified form:

I hope you got the idea!

Introduction to Bayes’ rule

One last topic of today’s article is yet another very important concept called “Bayes’ rule”. I’m pretty sure you have heard of it before. Let’s take a closer look to understand this concept.

So the formula itself is pretty easy. But what does this formula mean and how can we get this formula from what we already know?

First, let’s try to understand the formula itself. So the above well known formula can be rewritten based on what we have already learned.

Doesn’t this look familiar to you? Yes! It’s just like the conditional probability formula where we normalized to make the area 1 so that we could treat it as a PDF.

So to break down the Bayes’ theorem, it is a way to calculate the “posterior” using 3 terms: “likelihood”, “prior”, and “evidence”.

  • posterior: what we know about y after observing x
  • likelihood: propensity for observing a certain value of x given a certain value of y
  • prior: what we know about y before observing x
  • evidence: a constant to ensure that the left hand side is a valid distribution (normalization term)

It is useful because even if it is difficult to calculate posterior for example, it is often easier to calculate from likelihood, prior, and evidence. Same thing goes with other terms.

I hope you understand the formula by now. Let’s see how we can derive this formula from what we already know.

Basically, we can derive the Bayes’ theorem from conditional probability definition. This is an important concept so if you are not sure about something, make sure to spend some time understanding it!

Summary

  • Probability

The measure of the likelihood that an event will occur represented in numbers between 0 to 1.

  • Random Variable (RV)

Just like any other variable, a variable that can be used to represent possible values of random outcomes.

  • Probability Density Function (PDF)

The density of a continuous random variable represented as a function where area under the curve shows the probability of the RV itself.

  • Joint Probability

The probability that two or more random variables occur at the same time.

  • Marginalization

The way to ignore a certain random variables to get a probability with less random variables.

  • Conditional Probability

The probability of an event given that (by assumption, presumption, assertion or evidence) another event has occurred.

  • Bayes’ rule

The rule that ties posterior probability with likelihood, prior, and evidence.

I hope this helps! See you next time!

--

--

Sho Nakagome
sho.jp

A Neuroengineer and Ph.D. candidate researching Brain Computer Interface (BCI). I want to build a cyberbrain system in the future. Nice meeting you!