Graphically understanding the Bayes Theorem

Massimo Pierini
10 min readOct 17, 2022

--

The Bayes Theorem is one of the most important theorems of the Probability Theory. It states that

The Bayes Theorem.

But what does it mean? Can we graphically represent it?

To graphically understand the Bayes Theorem, let’s analyze and solve a problem. The problem is:

“Let C₁ (coin one) and C₂ (coin two) be two different coins. We need to choose a coin. The probability of choosing the coin one is equal to 10%. The chance of getting a head is 95% for coin one and 30% for coin two. If we got a head, what is the probability that C₁ was chosen?”

First of all, let us note that: the probability of choosing coin two P(C₂) is the complement to 1 of P(C₁), i.e. P(C₂) = 1-P(C₁) = 1–0.1 = 0.9 = 90%; the events are incompatible. So, they are a partition of the event space Ω.

Wait, what…!? The complement to 1? Incompatible? Partition? Event space Ω? What are these things?

Let’s start from the beginning. The event space Ω is the collection of all possible outcomes of a trial. The event space Ω of a coin toss is {head, tail}. The event space Ω of a 6-sided dice roll is {1, 2, 3, 4, 5, 6}, etc. From the 2nd Kolmogorov axiom, we know that P(Ω) = 1. This means that i) if we take an action, the probability that something is going to happen is certain, and ii) the sum of the probabilities of all possible outcomes must be equal to 1. We can initially visualize the event space Ω, as an empty square.

The event space Ω

Two incompatible events, A and B, are defined as events whose intersection is null, A⋂B = ∅ or P(A⋂B) = 0. They cannot happen together. If one happens, then the other cannot happen at all. For example, “being a minor” and “having a driving license”: one cannot be a minor AND have a driving license, if you have a driving license you cannot be a minor, if you’re a minor you cannot have a driving license, you can be a minor OR have a driving license but these events cannot happen together. An example of non-incompatible events is “being born in Italy” and “living in France”: these two events can also happen together, they’re not mutually exclusive. We can visualize incompatible events as disjoint sets in the event space, and non-incompatible events as overlapping sets in the event space.

Example of incompatible and compatible events in the event space Ω

A partition H of the event space Ω is defined as a collection of n incompatible events whose union is the event space itself or, in formula, Hᵢ : SUM(Hᵢ) = 1, with i = [1…n] and P(Hᵢ⋂Hⱼ) = 0 for i,j ∈ {1…n} and i ≠ j.
Even if it seems complicated, it is indeed very simple. We get a partition of the event space Ω if we chop it in many “subspaces”: each subspace has no intersection with the others (they are incompatible) and if we collect all of them together we get back the entire event space.

Example of partition of the event space Ω

So… saying that C₁ and C₂ are a partition of Ω means that they are incompatible (indeed, we cannot choose both coins, but one only) and that their union is the entire event space (indeed, there are no more choices in our trial). How can we then visualize these two events in the event space Ω? For example, simply by drawing a vertical dashed line that splits the space into two parts where one is 10% of the space and the other is the remaining 90%.

Partition of the event space Ω into two parts: C₁ and C₂ respectively to the left and to the right of the vertical dashed line.

Now, let us better understand what it means that “the chance of getting a head is 95% for coin one and 30% for coin two”. Let us first focus on coin one. The problem is saying that, if we choose coin one, then the probability of a head is 95%. So, choosing coin one is our condition and getting a head is our event. This is a conditional probability, we’ll write it as P(H|C₁) = 0.95 and we’ll read “the probability of getting a head, given the coin one, is 95%”. How can we visualize this conditional probability in the event space Ω? Well, we have already split Ω into two parts: C₁ and C₂. We then only need to split the small C₁ rectangle into two other parts: head H (the green one) and tail T (the red one), where H is 95% of the C₁ rectangle…! Yes, because, in the C₁ subspace, head and tail are a partition: they are mutually exclusive (incompatible) and their union is the entire subspace C₁. Complementarily, we can say that the probability of getting a tail given coin one is P(T|C₁) = 1-P(H|C₁) = 1–0.95 = 0.05 = 5%, i.e. P(T|C₁) is the complement to 1 of P(H|C₁).

Partition of the subspace C₁ into the two conditional probabilities P(H|C₁) and its complement to 1 P(T|C₁)

We can do exactly the same for P(H|C₂) = 0.3 = 30% and P(T|C₂) = 1-P(H|C₂) = 1–0.3 = 0.7 = 70% and split the big C₂ rectangle into two parts, where the green H one is 30% of C₂ rectangle and the red T one is the remaining 70%.

Partition of the subspace C₂ into the two conditional probabilities P(H|C₂) and P(T|C₂). The event space Ω is now completely parted.

Good! Now we have all we need to solve the problem but we first need to understand what the problem is asking. The problem asks “If we got a head, what is the probability that C₁ was chosen?”. What does this mean? Well, it is now clear that this is a conditional probability: given that someone already chose a coin, flipped it and got a head, what is the probability that the chosen coin was C₁? In formula, the problem is asking P(C₁|H), and we read it “the probability of C₁ given H”.
How could we do this calculation? First of all, we need to identify our condition. The condition is “someone got a head”. What does this correspond to in our event space Ω? It is very simple: the entire green area!
So, the green area is our condition and which part of it is the event we’re looking for? The part that “belongs” to coin one, i.e. to the left of the vertical dashed line.
So what the problem is graphically asking? It is asking: considering the green area only, which is the percentage of it that lies to the left of the vertical dashed line?

The probability of head P(H) as the entire green area, sum of the two conditional probabilities areas P(H|C₁) and P(H|C₂).

To compute this percentage (which actually is a probability) we need to first compute which are the percentages of the green subspaces, P(H|C₁) and P(H|C₂), with respect to the entire event space Ω. Yes, because we only know the percentages with respect to the single subspaces C₁ and C₂, but not to the entire space. In fact, if we sum up all the probabilities, we get 0.95+0.05+0.3+0.7 = 2 which is not 1, and this violates the 2nd Kolmogorov axiom P(Ω) = 1.
First, let’s compute the probability of “having a head AND choosing the coin one”, P(H⋂C₁). How can we do this? We know that P(C₁) = 0.1 and P(H|C₁) = 0.95, we then can simply multiply them together. Why? Let’s take an example: I have got 100$, I give the 10% to John and John gives the 95% to Jack; how much does Jack get? If I give the 10% to John means that John gets 100$•0.1 = 10$. If John gives Jack 95% of what he got, it means that John gives 10$•0.95 = 9.5$ to Jack. Putting it all together, Jack gets 100$ • (0.1•0.95) = 9.5$.
So, the probability of P(H⋂C₁) is simply equal to 0.1•0.95 = 0.095 = 9.5%. This means that the small green rectangle to the left of the vertical dashed line is 9.5% of the entire square (the event space)!
Similarly, we can now compute P(H⋂C₂). It will simply be 0.9•0.3 = 0.27 = 27%, which means that the green rectangle at the right of the vertical dashed line is 27% of the entire square (the event space).

Thus, which is the overall probability of getting a head? It is obviously the sum P(H) = P(H⋂C₁) + P(H⋂C₂) = 0.095 + 0.27 = 0.365 = 36.5%. This means that the green area is 36.5% of the entire square. Why can we simply sum them up? Because of the 3rd Kolmogorov axiom which states that if P(A⋂B) = 0 then P(A⋃B) = P(A) + P(B) that is, if two events are incompatible then the probability of their union is the sum of their probabilities. When two or more events are not incompatible, we need to use the Inclusion-Exclusion Theorem, about which you can read in this story.

How we do finally get the requested probability P(C₁|H)? Well, we know that P(H⋂C₁) = P(C₁)P(H|C₁) but similarly we can also say that P(H⋂C₁) = P(H)P(C₁|H) from which we get that the requested probability is P(C₁|H) = P(H⋂C₁)/P(H) = 0.095 / 0.365 ≅ 0.26 = 26%. And this is exactly what we were looking for: the probability of having chosen the coin one given that we got a head is approximately 26%. Graphically, this simply means that the percentage of the green area to the left of the vertical dashed line is approximately 26% of the entire green area…!

Now… let us “rename” our events. Let’s say that C₁ is “having a certain disease” D₁, and C₂ is “not having that disease” D₀. Say then that H (getting a head) is “getting a positive diagnostic test for the disease” R₁ and T (getting a tail) is “getting a negative diagnostic test for the disease” R₀ and let us substitute everything in our problem:

P(D₁) = 10% ⇒ P(D₀) = 90%
P(R₁|D₁) = 95% ⇒ P(R₀|D₁) = 5%
P(R₁|D₀) = 30% ⇒ P(R₀|D₀) = 70%

The same problem with the probability of having a disease and the result of a diagnostic test.

What is P(D₁)? It is the prior probability of having the disease, it could be the Prevalence of the disease in a specific population, for example, the prior probability of having COVID-19 in a population of young adults with cough and temperature higher than 37.5°C (note: this is only an example and not the actual probability in real life).
P(R₁|D₁), in biostatistics, is called “Sensitivity of the test”, the probability of getting a positive test result if one is actually diseased.
P(R₀|D₀) = 70% that is the complement to 1 of P(R₁|D₀), in biostatistics, is called “Specificity of the test”, the probability of getting a negative test result if one is not actually diseased.
So the problem would ask: given a prior disease probability, and a test with a known Sensitivity and Specificity, what is the probability of having the disease if the test result is positive?
And the answer is, as we saw, P(D₁|R₁) = 36.5%.
Thus, we could say that this is not exactly an optimal test to diagnose the disease because the patient has got only a 36.5% probability of being diseased if the test result is positive.

But… what would happen if the test is negative? That is, what is the probability of not having the disease if the test is negative P(D₀|R₀)?

Visualization of the posterior probability of not having the disease given a negative test result.

To compute it, we need to do the same as we did before. First, let us compute P(D₀⋂R₀) = P(D₀)P(R₀|D₀) = (1-Prevalence)•Specificity = 0.9•0.7 = 0.63 which is the percentage of the red area to the right of the vertical dashed line with respect to the entire square (the event space Ω). Then, P(D₁⋂R₀) = P(D₁)P(R₀|D₁) = Prevalence•(1-Sensitivity) = 0.005 = 0.5% which is the percentage of the red area to the left of the vertical dashed line with respect to entire event space. And finally we can get the desired probability P(D₀|R₀) = P(D₀⋂R₀) / [P(D₀⋂R₀) + P(D₁⋂R₀)] = 0.63 / [0.63+0.005] ≅ 0.99 = 99% which is the percentage of the red area to the right of vertical dashed line with respect to the entire red area and is the posterior probability of not having the disease given a negative test result. Thus, even if this is not a good test to diagnose the disease if the result is positive (it should be repeated, but this would change both Sensitivity and Specificity, you can read about in this Part II), we can say that it is a fair good test to exclude the diagnosis if the result is negative.

To conclude, note that, for example, P(D₀⋂R₀) + P(D₁⋂R₀) is simply P(R₀). What did we do? We used the Law of Total Probability. It states that, if we got a partition H of the event space, and an event E, then the P(E) probability of E is the sum of the intersections between E and the Hᵢ subspaces, that is P(E) = SUM(E⋂Hᵢ) with i={1…n}. Let’s visualize it, as usual.

The Law of Total Probability: an event E and a partition of the event space.

What does the Law of Total Probability says? It simply says that if we take each “portion” of the E circle resulting from the intersection between E and each Hᵢ subspace and we sum up them together… we get E itself.

But, from the Definition of Conditional Probability, we know that P(A⋂B) = P(A)P(B|A) = P(B)P(A|B). Taking the last two terms and moving P(B) to teh first member we get

The Bayes Theorem for two events A and B.

Applying the Law of Total Probability to P(B) we get that P(B) = P(A⋂B) + P(A̅⋂B) which, thanks to the Definition of Conditional Probability, becomes P(B) = P(A)P(B|A) + P(A̅)P(B|A̅) and so we get

The Bayes Theorem for two events A and B, with the Law of Total Probability and the Definition of Conditional Events.

Finally, generalizing everything we said for a partition of the event space into n incompatible subspaces, we finally get to general formulation of the Bayes Theorem

You can play around with the graphical representation of the Bayes Theorem for a diagnostic test with this DESMOS.

--

--