Probability: (Almost) Everything You Need To Know

Intuition for common probability concepts in a continued example style

Lance Galletti

17 min readOct 7, 2021

Introduction

While statistics reasons about a population, knowing characteristics about a sample,

probability reasons about a sample, knowing characteristics of a population.

The make up of a population is best described by what is called a Distribution which, at a high level, attempts to characterize the probabilities associated with all the possible subsets of the population. The context for what subsets are and aren’t relevant to the problem is captured by what is called a Random Variable.

For example, our population could be a group of people and a random variable might only be interested in the age of each person in the group. As such, the distribution of our random variable would be the list of all possible ages (1, 2, 3,…) along with the probability of a person being that age. This probability would represent the expected proportion of people we would expect to see with that age were we to sample from our population a large number of times.

As the list of possible outcomes of our random variable grows, it becomes convenient to write the distribution as a mathematical function rather than a list.

Flipping a Coin

A coin flip has two possible outcomes: Heads or Tails. Formally, we use the Greek letter Omega (Ω) to refer to the sample space (all possible outcomes). With a coin that would be:

Ω={{Heads},{Tails}}

Although the number of possible outcomes here is very low, we would still like to formulate our distribution as a mathematical function (involving numbers). This is where our Random Variable comes in, translating abstract outcomes from Ω into numbers we can mathematically reason about (just like above from “person” to “age”).

In this case, we will define the following Random Variable:

i.e. Heads translates to the number 1 and Tails to the number 0. Albeit a bit pedantic, this will help us in the long run. We can now define a distribution function f that gives us a probability for every possible values X can take (here 0 or 1). For a fair coin we would have the following distribution:

More generally, we can capture any coin (fair or unfair) by its probability of Heads. If the probability of Heads is say p then the probability of Tails must be 1 — p. This describes the following distribution function:

Another way to write this is:

Which we can graphically represent as such:

Formally, this is known as the Bernoulli Distribution.

Repeated Coin Flips

Now suppose we flip the coin multiple times and ask: what is the probability that it takes 5 coin flips before we see the first Heads? How about 7 or 10 or any arbitrary number r of flips?

Let’s call X₁ the Random Variable describing the first coin flip, X₂ the Random Variable describing the second coin flip… etc. Let’s assume that these coin flips are made with the same coin and hence every flip has the same (identical) distribution described above for X.

Let’s further assume that X₁, X₂, … are independent, meaning the result of the one will not affect the result of the next. Mathematically this means that the probability of X₁ seeing a particular outcome AND X₂ seeing a particular outcome will be the product of these probabilities. That is:

Notice that asking for the first Head to occur on the 5th coin flip is the same as asking for the first 4 coin flips to be all Tails and the 5th to be Heads. So we need to compute:

Given the independence assumption above this will be the product of these probabilities.

And, given all these flips have the same distribution (identically distributed), we know the probability of Heads p is the same for each coin flip. So the probability that it takes 5 coin flips to see the first occurrence of Heads is:

In the general case then, when we ask what the probability is that it takes r coin flips to see the first Heads, we need the first r — 1 to be tails and the last one Heads:

So the distribution function for the number r of flips it will take to get the first Heads is

This is called the Geometric Distribution and has the following graphical representation:

Now suppose we only flip the coin 5 times and ask: what is the probability that the coin falls on Heads exactly twice?

The question is challenging not only because we have to distinguish the probability of getting 2 Heads from that of getting 1,3,4 or 5 but we also need to account for all the ways in which 2 Heads can appear out of 5 coin flips. Notice that we can have:

HHTTT or HTHTT or HTTHT…

Recall from Combinatorics there are 5C2 (5 choose 2) ways to choose an unordered subset of 2 elements from a fixed set of 5 elements. Notice that the probability of each of these 5C2 ways is the same because of multiplication commutativity:

which is the same as

and so on …

So what is the probability that any of these combinations occur? Well, either HHTTT OR HTHTT OR …etc. occurs. Since all these events are mutually exclusive, the probability of their union is just the sum of their probabilities:

P(HHTTT) + P(HTHTT) + …

Since this is a sum of 5C2 identical terms it is thus equal to:

We can follow the same reasoning for the general case where we ask for the distribution of the number k of Heads in n coin flips where we get the following Binomial Distribution:

it has the following graphical representation:

Notice that:

So this gives us a proper probability.

Over a Long Time

Now suppose we flip the coin over a long period of time and notice that on average it lands on Heads once per minute. What is the probability of the coin landing on Heads 3 times in any given minute?

Notice that if we knew exactly how many coin flips occur in any given minute then we could just use the Binomial Distribution above. In this case however we only know that on average the number of Heads per minute is 1 (let’s call this rate λ). Some minutes there could be 5 occurrences of Heads, some minutes there could be none.

What we haven’t specified is how often we flip this coin. Notice that, flipping the coin many times with a low probability of Heads could lead to the same number of occurrences of Heads per unit of time (i.e. the same rate) as flipping the coin few times but with a high probability of Heads.

Surely we can’t physically flip a coin more than once per second. Since we just know the rate λ (here it’s 1 per minute), we can assume the coin is flipped every second and has a probability of Heads that equates to seeing 1 per minute. If we looked at a 1 second interval, the rate is 1/60 per second (since there are 60s in 1min). In that 1 second interval of time where we flipped the coin, we know its distribution: Bernoulli where p = λ/60 = 1/60.

So we can break up our 1min interval into 60 1-second intervals and ask our question again: what is the probability that 3 of these 1-second intervals result in Heads?

We know how to do this! It’s the Binomial Distribution with n = 60, k = 3 and p = λ/n = 1/60.

Well, not quite… Our assumption was that we can’t make more than one coin flip per second. And although that may be true in the physical world, this is not a good enough assumption for the theoretical math world.

In this theoretical world there is no limit to how close two coin flips can be, shy of being simultaneous. To capture this we need to further break up our 1 minute interval to the theoretical limit of infinitesimally small sub-intervals. That is, we want to take the limit as n (the number of sub-intervals) goes to infinity of the Binomial Distribution:

We won’t go through the derivation here but the Poisson Limit Theorem tells us that this converges to the following:

This is called the Poisson Distribution and has the following graphical representation:

As you would expect the likeliest number of Heads is around λ. Using this distribution we can compute the probability that the coin lands on Heads 3 times in any given minute. Recall that λ is 1 per minute:

Now Suppose we now want the number of occurrences of Heads in a different unit of time. For example, what is the probability that the coin lands 65 times in an hour? Or twice per day?

The Poisson Distribution assumes a number of occurrences λ per given unit of time. In order to convert to a new unit of time, we need to multiply λ by the number of old units contained in the new unit. For example to convert a λ of 1 per minute to hours you need to multiply by 60 since there are 60min in an hour.

Being able to move between units lets us answer questions like: what is the probability that there are no occurrences for 5 minutes? We can use a time conversion trick to move from λ of 1 per min to a λ of 5 per 5min which then allows us to compute the probability that no Heads occur in that unit of time (5min):

Time in Between

Now suppose we coninue with the same set up as above but we ask: what is the probability that the time in between two coin flips is 2min?

There are a continuum of possible values for the time it could take between two occurrences of Heads. Exactly 2min is an infinitesimally small part of that continuum. Asking for its probability is like asking for the length of a point — clearly a point has no length but an infinite amount of points can make up a line which itself has length. Strangely, 2min has a probability of 0 but it’s not impossible for the interval to be exactly 2min…

When it comes to a continuum of values it makes more sense to ask about the probability that the time T between consecutive Heads falls in some range. It’s further helpful to re-frame the question as the time T we need to wait for the next occurrence of Heads.

We can then ask: what is the probability that we need to wait more than t units of time for the next occurrence of Heads? Notice that this is equivalent to asking: what is the probability that there have been no occurrences of Heads for t units of time? Using our time conversion trick from above we convert λ into λt and get the probability that there are no occurrences of Heads in that unit of time λt:

in order for P to be a proper probability, notice that we must have:

We call F(t) = P(T ≤ t) the Cumulative Distribution Function (CDF). CDFs are used to characterize distributions of continuous Random Variables (like time in this case). In contrast, distributions of discrete Random Variables (which we had seen so far) are characterized by Probability Mass Functions.

The CDF above is that of the Exponential Distribution. It gives us the probability of waiting at most t for the next occurrence of Heads.

Because the CDF is always increasing toward 1, it can become difficult to tell which regions of the continuum contribute more or less to this accumulation. We may want to know for every t, what instantaneous change to the CDF it provides.

Taking the derivative (representing instantanous change) of the CDF, we can define the Probability Density Function (PDF):

From the PDF we can get back the CDF through integration:

In fact we can also get the probability of falling in an interval (t1, t2) by integrating the PDF from t1 to t2:

So the area under the curve of the PDF is the probability of falling in that region. So we have a way to get the probability of falling in any interval of time.

Now Suppose we would like to know: if we’ve already waited a certain amount of time, should we expect to wait less? That is, if we’ve already waited for 1min, does that mean an occurrence of Heads is probable in the next 10s?

Notice that P(T ≤ s + t | T > s) gives us the conditional probability of waiting t additional time given we have already waited s time. Which we can compute as follows:

If the time is less than or equal to s+t and greater than s (as specified in the numerator), that’s just the interval [s, s+t]. So the numerator becomes:

From above, we know that is equal to F(s+t) — F(s) so we can simplify:

We can now plug in the CDF of the Exponential Distribution:

Meaning the probability of waiting t units of time for the next occurrence is the same regardless of whether we’ve already waited a certain amount of time.

Recall above we assumed all the coin flips were independent which means the outcome of previous coin flips does not affect the outcome of future coin flips. Even though we may have waited 1min, the probability that the next occurrence will happen in the next 10s is the same as if we had not waited at all. This property is know as the memoryless property.

This property not only affects the probability but also the expected waiting time because the average waiting time is not determined by when you started waiting.

Re-using our example above, if we expect 1 occurrence of Heads every minute it would make sense for us to expect to wait 1 minute in between occurrences. So if our rate of occurrence of Heads is λ, on average we expect to wait 1/λ for the next occurrence of Heads. And, because of the memoryless property, the average wait time doesn’t change if I’ve already waited 30s — I should still expect to wait 1min from then on.

Sample Average Waiting Time

Now suppose we start recording waiting times between occurrences of Heads. Let’s call w₁ the first wait time we observe, w2 the second wait time etc. Here we are particularly interested in the average observed waiting time:

Every time we record a new waiting time we update our average observed waiting time. Re-using our expected wait time of 1 minute from above, if we’ve observed 10 occurrences of Heads (or 10 waiting times): what is the probability that the average observed waiting time is greater than 2min?

At first glance this may seem like an impossibly difficult question to answer. Notice that this is not the same as asking that all our sample waiting times be more than 2min — we can get an average of over 2min without all samples being over 2min.

Let’s look at the simple case where we only have two sample waiting times W₁ and W₂, and let’s ignore the division for now. What is the distribution of the sum of the two waiting times P(W₁ + W₂ ≤ t)?

Notice that if we know that the waiting time W₁ was observed to have a value of w1 then we just need to compute P(W₂ ≤ t — w₁) so that W₁ + W₂ ≤ t. For each of the values w₁ of W₁, we want to know P(W₂ ≤ t — w₁ | W₁ ≤ w₁). Taking the “weighted sum” of all the P(W₂ ≤ t — w₁ | W₁ ≤ w₁) weighted by the probability of w₁ for all possible values w₁ will give us P(W₁ + W₂ ≤ t).

Weighted sum is in quotes because we are operating on continuous random variables. We would actually be taking the integral over all possible values w₁ and the weights would not be the probability of w₁ (since this is just 0 for continuous random variables) but the value of the PDF at w₁.

Where fW₁(w₁) is the PDF of W₁ (weighing P(W₂ ≤ t — w₁ | W₁ ≤ w₁) for each w₁). Recall that waiting times are iid, meaning:

So we can simplify the above equation:

So we now have a CDF for W₁ + W₂ and can get the PDF by taking its derivative.

We can use Z to denote the Random Variable W₁ + W₂. If we now have a third variable in our sum called W₃ and want to know the CDF of W₁ + W₂ + W₃, we can apply the same trick above on P(Z + W₃ ≤ t).

We can keep applying this trick in order to get the distribution of the sum of n iid Exponential Random Variables: W₁ + W₂ + … + Wₙ:

which is the PDF of the Erlang Distribution. Going back to our original question: what is the probability that the average observed waiting time for a sample of 10 waiting times is greater than 2min? Given a rate λ of 1 occurrence of Heads every minute, we can compute:

Which should align with our intuition that it is unlikely to see an average greater than 2min. We can intuit that the average observed waiting time should more likely be around 1min.

Now suppose we ask: how big would an interval centered around 1min need to be in order for 99.5% of all samples of size 10 to yield an observed average in that range?

We want 99.5% of all samples of size 10 to yield an average in the range [1-x, 1+x].

This means for any given sample of size 10, the probability that its observed average falls in [1-x, 1+x] should be at least .995. So we need to solve for x the following equation:

which is a very difficult equation to solve…

Luckily, from above we’ve already found a suitable value for x.

Indeed when x = 1:

Probabilities of Proportions

Now suppose we have a sample of 100 waiting times, where, as above, our expected wait time is 1min. We ask: what is the probability that the proportion of waiting times above 1min is greater than 50%?

The expected proportion of waiting times below 1min can be computed from the CDF:

Which evaluates to ~.632. So the expected proportion of waiting times above 1min is ~.368.

Proportions are a continuous random variable so the distribution we’re looking for is a PDF or a CDF. What might the PDF look like at 50%?

The success rate we are asking for is 50% (or .5). But from above we know to expect a success rate of .368 and a failure rate of .632. For a sample of size 100, we then expect 36.8 successes and 63.2 failures.

So our PDF at 50% should be proportional to:

Look familiar?

Recall, the Binomial Distribution gives us the probability of a certain number of successes and failures given we know the expected proportion of occurrence of a success. In this case, we know the expected number of successes and failures but we want to know the probability of the expected proportion of occurrence of a success.

So our PDF should look something like:

The range of f is [0,1] since we are asking for proportions. So the constant factor we need in order for the PDF to integrate to 1 over its range must be:

So we can rewrite f as:

which is the PDF of the Beta Distribution.

Integrating f over [.5,1] we see that the probability that the proportion of waiting times above 1min is greater than 50% for our sample of size 100 is ~.0038.

In practice though we don’t know how the sample was generated. Statistics attempts to uncover characteristics of the unknown process that generated the sample we have access to — usually trying to estimate the distribution that is likely to have generated the sample we observed. This will be the topic of an upcoming post — stay tuned!