Probability theory. (Part 1/3)

14 min readJun 22, 2017

I had a very hard time understanding different aspects of probability theories and hence I dig various blogs, articles and complied it into a single blog. Hope it helps you to get a kickstart.

What is Probability?

Probability is mathematical way of computing uncertainties. In computer science we are used to discrete computations but the real word is very uncertain. For example if you shoot a ball to basket it might or not make it as various environmental factors defines the outcome. Study of probability is very helpful when we need to model complex systems. For example lets consider two statements, “Most birds fly” & “All birds fly, except ostrich, kiwi, baby birds, penguins… ”. The two statements are true but seconds is more complex as it needs to consider every possible non flying birds!!. First statement helps to generalize model into a simple manageable statement. Similarly probability theory helps to generalize complex systems with a simple model.

Here are the topics I am planning to cover in this blog.

Random Variable
How to compute probability ( Box & Tree method )- General/Conditional Probability
Bayes Theorem
Probability distribution functions.

Random Variable

let x be a variable which can take any value. We can call ‘x’ to be a random variable since it can take any random value from a set or a distribution.

Example: If I have red ball, green ball & blue ball. Then x can take red ball, green ball or blue ball and it cannot take any other color ball since in our set we only have red, green or blue. Similar lets say in an integer range from 0 to 10. x can be 2, 3 or 4 etc. We can represent it like this x=2 when x is 2, or x=red ball when our random variable takes red ball etc.

As I said probability is chance. Lets say if I have

5 red balls, 3 green balls, 2 blue balls and I put them all into a box. I shake the box for few seconds and I will ask you to close your eyes and pick a ball. Now what are the chances that you picked a red ball? We know we have a total of 10 balls and out of that 5 balls are red. There are more red balls than green balls & There are more red balls than blue balls. number of(#) red balls is equal to sum of green & blue balls. #Rb = #Gb+#Bb. The proportion of red balls are half the total number of balls so the chances of picking red ball is 5/(3+2) which is 0.5 or we can say 50%.

In terms of probability we can say P(x=red ball) is 0.5. i.e probability that my random variable x can be red ball is 0.5. It is same as saying probability of picking a red ball is 0.5.

How to compute probability?

Probability in terms of equation can be represented as

Total number of favorable outcomes / Total number of outcomes.

Ie. in case of 5 red balls and 3 green balls, if we were to pick a ball randomly the chances of picking a red ball is 5 / 8 since there is 8 ways of picking a ball randomly and 5 favorable (red)ways of picking red ball.

Computing probability can be very easy if you can represent your data in structured way. Two methods you we can use is Box method and Tree Method.

Lets take an example, There are 3 bowls (Bowl A, Bowl B, Bowl C). In Bowl A (Ba) we have 4 Blue marbles (Bm), 2 Red marbles (Rm), In Bowl B (Bb) we have 2 Blue marbles and 1 Red marble and finally in the last Bowl Bowl C (Bc) we have 4 Blue marble & 5 Red marble.

This might look like lot of data so first lets represent it in a structured way.

— — — — — — Blue marbles(Bm)— Red marbles(Rm) — — Total marbles

|Bowl A (Ba)|______4________________2___________________6_______|

|Bowl B (Bb)|______2________________1___________________3_______|

|Bowl C (Bc)|______4________________5___________________9_______|

|Total marbles|____10________________8__________________18_______|

Now we can see that our data is represented. If I ask how many marbles are there in Bowl A we can quickly see and tell 6. If I ask how many marbles in Bowl C are red we can see and tell 5.

Now if I ask what is the probability of picking a red marble P(X=Rm) from any bowl we can see total out comes of picking marbles are 10+8 = 18 or 6+3+9 =18. and out of them 8 are favorable (8) so answer will be 8/18.

Marginal probability

In marginal probability there is no dependence on any other variable. for example I can ask you question what is the probability of picking a red marble. Here we don`t have specific bowl requirement we are free to pick from all bowls. so its 2/6 from bowl A+ 1/ 3 from Bowl b+ 5/9 from bowl C. Marginal probability captures probability for picking a red marble across all possible scenario we can represent it in equation.

i indexes from 0 to 3 for three different bowls. This example was for discrete case and for continous case it will be

We will study continous cases later when we visit probability distributions section.

Conditional probability

Now lets jump to next level. What is the probability of picking red marble from Bowl A?. It should be straight we have 6 marbles in Ba and 2 are red so answer is 2/6. Lets say our Bowl A , Bowl B and C have different weights. If I put these Bowls on a turn table and rotate them then probabilities of Bowl A, B and C are not say. In an ideal world if every atom of Bowl A, B, C were same then we could say that probabilities are same i.e 1/3 for selecting a bowl. For this example lets say probabilities of picking Ba, Bb & Bc are 0.4, 0.3, 0.3. Now I close your eyes and I pick a red marble randomly from a bowl. I ask you to open your eyes and ask a question what is the probability that this red marble came from bowl A?. If you pay attention to this question if you notice that answer to this question not only depends on picking a marble but also picking a bowl. We have three bowls out of three what are the chances of picking bowl A? and if we “do pick” Bowl A what are the chances of picking red marble from it? Probabilities which depend on some other probabilities are called conditional probabilities. In terms of equation we can represent it as

P(X=Rm|Ba) ie. Probability that my random variable is red marble given that I pick from Bowl A. That symbol “|” is read as given that. Question like this can be easily solved with drawing a tree.

Image1: Probability distribution tree for red, blue marbles and A,B,C bowls

Each branch of this tree represents a probability. 0.4 represents probability of picking Bowl A. 0.3 represents P(Bb) and P(Bc). P(Rm|Ba) represents picking a red marble from bowl A and it is 0.33 or 2/6 and we solved it earlier. Now for the question asked given that we picked a red marble what is the probability it came from bowl A ?

(P(Rm|Ba)) is 0.4 * 0.33: i.e upon selecting bowl A with probability 0.4 then only we can pick either red or blue marble and we know probability of picking red from that bowl is 0.4*0.33.

Note P(Rm|Ba) is not same as P(Ba|Rm)!!. The first one P(Rm|Ba) means probability of picking a red marble given that its Bowl A. Second one means If we have picked a red marble what is the probability that it came from Bowl A?

We know probability is the ratio of number of favorable outcomes to total number of outcomes. Lets first find the total number of outcomes. We can pick a red marble either form bowl A, B, or C and each of those bowls have there own probability.

So our answer should be ratio of probability of picking red marble form bowl A (Favorable) to probability of picking a red marble from any bowl P(Rm). Probability of picking a red marble should be probability of picking a red marble from bowl a + probability of picking red marble form bowl b + Probability of picking red marble form bowl C. We have already solved P(Rm|Ba) lets solve other using same logic and add them up.

P(Rm) = P(Rm|Ba)+P(Rm|Bb)+P(Rm|Bc)

P(Rm) = 0.4*0.33 + 0.3 * 0.33 + 0.3 * 0.55

= 0.396

Now Favorable event P(Rm|Ba) is 0.4*0.33 i.e 0.132

So P(Ba|Rm) = 0.132/0.396 i.e 0.33

Similarly try to solve this question. Given that I had picked a Blue marble what is the probability that it came from Bowl c? i.e P(Bm|Bc).

Now lets represent it in form of neat equation.

If we call them as events to generalize P(Ba) is event A and P(Rm) as event B then

This is nothing but Bayes rule.

Bayes Theorem.

Each term in that equation captures a special information about our events. When I asked you that question. Given that I had picked a red marble what is the probability it came from bowl A, I actually gave you some evidence and your task was to hypothesize which bowl it came from. Now the evidence that marble is red means it could have come from bowl a, bowl b, bowl c. So you have hypothesized three different theory. Theory one would say marble could have come from bowl a, theory two would say marble could have come from bowl b and last theory would say marble could have come from bowl c. Among these 3 theories which hypothesis do you believe is true?. We can never be 100% sure that this red marble came from a particular bowl but what we can do is we can be certain to some degree that it came from a particular bowl.

I.e as we solved in the previous section you can be 0.33 or 33% sure that it might have come from bowl A.

So, I gave you an evidence that marble is red you proposed a theory with three hypothesis ie P(Ba|Rm), P(Bb|Rm), P(Bc|Rm). So each term in bayes rule captures these interesting

In P(Ba|Rm) I gave you an evidence its a red marble and you hypothesized that is 33% chance it might have come from Bowl A and this is called Posterior probability. If you refer the tree which we draw you can see where all these terms lie on it. P(H) is called Prior probability. Lets see what it is called so. If I had not given you any evidence, ie. If I had just said I have picked a marble (I did not say color) and I had asked you to hypothesize what is the probability it came from bowl A then you had three bowls and it could have come from any bowl and each have their own probability so you would have hypothesized that there is 0.4 or 40% chance it came from bowl A (refer tree). So this term P(H) captures what is the probability of your hypothesis if I had not given you any evidence. P(E|H) captures information about likelihood. ie, if I had told you that bowl was bowl A what is the probability to pick a red marble. From out tree diagram it is 0.33 P(Rm|Ba).The denominator P(E) ie. probability of picking red marble P(R) and that came from solving this P(Rm|Ba)+P(Rm|Bb)+P(Rm|Bc) and this is called Marginal probability ie, under all probable cases what is the probability of picking red marble. Lets summarize this equation.

From now onwards lets follow the convention of H for hypothesis and e for evidence.

Joint probability

Lets add two green marbles to our above example and build a new tree.

Image2: Probability distribution tree for red, blue, green marbles and A,B,C bowls

Now if I ask you what is the probability that you pick a red and green marble from bowl A all we have to do is P(Rm|Ba)*P(Gm|Ba) and this is how we calculate joint probability.

Lets solve a problem to concretize our understanding.

From our bowl and marble setup I have picked one green marble and one red marble. what is the probability that it came from bowl A ?

As we can see we have two random variables for every possible bowls and its a joint probability problem. Our Evidence is red and green marble unlike the previous case which was only red marble. so P(e) = P(e=Rm,Gm) and my hypothesis P(H) is P(H=Ba). Here P(H) prior doesn't change as we haven`t changed our bowl arrangement, but our evidence for a given hypothesis changes as we have added two extra marbles(green) to all bowls. Here both out marginal and likelihood needs to consider both red and green marble.

so P(e|H) = P(X=Rm,Gm|Y=Ba) = P(Ba) * P(Rm|Ba)*P(Gm|Ba)

our marginal will be sum of all probabilities of having red and green marbles from all three bowls.

we can write this in a short equation. Before we write lets make sure we point out the difference between these two symbols.

sigma and integration is for summation in discrete and continous time. Pi symbol here represents product of n terms.

Where i indexes from i to n=3 representing bowls and j indexes from 0 to 1 representing marbles red and green. Now that we have our P(H) ie.0.4 since bowls configuration didn't change. We can answer to the question P(e=Rm,Gm|H=Ba)

In fact you can solve it just by looking at the tree and you don`t need to write in equation format. But writing in equations condenses written material and also it feels fancy to write equations!! :)

Expectation.

Lets continue with the same example and play a game this time lets take the original tree red & blue case. If you look at the tree each line segment has some probability value. In other words probabilities are distributed across the tree covering various scenario. When we say expectation we say what is the exception for a given distribution.

In this game if you randomly pick a red marble you will get 2 points and if you pick blue marble you will get 3 points. This action of wining a particular marble to getting points can be mapped with a function f(x). ie. f(red) is 2 and f(blue) is 3. exception is for all possible scenarios of winning (whether its red or blue) what is the total points we can accumulate.

We know how red & blue marbles probabilities are distributed based on the tree in image 1. and we know f(red) and f(blue)

expectation of function value for a random variable distributed over probability P in discrete & continous.

x~p represents that our expectation values depends on distribution. ie say if mess up with bowls and change bowl probabilities expectation value of red marble changes.

Probability distribution

The example which we took was a discrete case ie, the ball can be red, green or blue. Our random variable can take 3 discrete values. We can use probability on continuous cases too. To give you an intuition on continous case where our random various can take any value in a range. Let us consider a dart board

Now imagine this dart board is placed on a flat ground and your task is to drop a ball on exact center (bulls eye). Lets say the environment you are doing is task is in vacuum then there is a very high chance your ball might land on center. But lets say you do this task in a windy environment then the chance of your ball landing in bulls eyes is very low. Lets quantize the results of your task by measuring the distance from center to where your ball lands.

Let the radius of the board be 10cms & let us score your performance by this function. Score = 10- distance from center. If your ball lands exactly on center then you get 10 points since distance from center is 0 and if you ball lands outside the board or on edge then you get 0 points.

Now if I ask is the probability that your score is 10 (hitting a bulls eye) then its very very low because we need to hit at exact center i.e P(x=10). But if I set a range say around center in range of say +or- 0.5 then P(x=9.8) or P(x=9.9..) etc are in our reach. We can see that our random variable can take continuous values.For every value that our x can take P(x=xi) will give us a probability value. In discrete case i.e colored balls case we will have discrete values . In colored ball example

P(x=red ball) is 0.5 i.e 5/10

P(x=blue ball) is 0.3 i.e 3/10

P(x=green ball) is 0.2 i.e 2 / 10

where 10 is total number of balls.

Notice how sum of all probabilities adds up to 1.

0.5+0.3+0.2 = 1. If you think about it,if you pick all balls at once the probability of picking all balls must be 1.

Now lets plot a graph for this where my y axis hold probability values and x holds discrete states.

In case of dart board we can plot a graph. Where x axis holds random variable taking a distance(X=distance from center) or y axis holds probability of getting that score. In a real world lets say we perform the dart board experiment on a medium windy day where wind can blow from any direction.

Just like how in discrete case how all probabilities added up to 1 here

Area under the curve sums to 1.

Where P(x) is probability that our ball lands in that region. Since our dart board is of radius 10 cms. we integrate from -10 to 10. Negative is convention to indicate edge points on left half of board.

The plot says we you were to keep dropping the ball on dart board 40% of the time it lands in center and rest it will fall around it due to slight wind.

Probability distribution functions

Now that we have an intuition of how probability plots looks like we can move on to Probability distribution functions. I.e Can we write a function which can plot those curves. In continuous case we can plot those continuous curve but for discrete case we can plot a scatter plot.

In continuous case we call these functions which can plot probability distribution as “Probability distribution function” or PDF and in discrete case we call them “Probability Mass function” or PMF.

P(X=xi) means P is a probability distribution function distributed over X where X can take any value xi. In case of dart board P is a probability distribution function distributed over range -10 to +10 were xi can be any fractional number between -10 to +10.

Alright, this introduction should be good for us to start. In the next part we will dive deep into probability distributions and its applications.

Probability theory. (Part 1/3)

What is Probability?

Random Variable

How to compute probability?

Marginal probability

Conditional probability

Bayes Theorem.

Joint probability

Expectation.

Probability distribution

Probability distribution

Written by Naveen Mysore