Bayes’ Theorem Unbound
E.T. Jaynes’ reformulation of Bayes’ theorem is as beautiful as it is useful and gives deep insight into the impact of evidence on the probability of just about everything
Bayes’ theorem tells us how to update probabilities in the light of new data. When we’re assessing the probability of an event with a binary outcome — something either happens or it doesn’t — there is a particularly elegant formulation due to the high priest of Bayesianism E.T. Jaynes that richly deserves a much wider audience than it has today. It has consequences for how we look at success and failure in all branches of life.
Vanilla Bayes’ Theorem
(The traditional practical example for introducing Bayes’ theorem is tests for medical conditions, but I figure we’re all pretty tired of talking about those kinds of test, so here’s another example drawn from the annals of lockdown)
What is the probability my son Theo is currently playing Minecraft on the internet?
It’s 1600. He plays for an hour between 1500 when he finishes home schooling and 1900 when he comes down for dinner, but it’s pretty random exactly when. In the absence of further information, we’ll say the probability is 25%.
Now I notice that our internet connection has ground to a halt. I know this because my other son Carl is trying to watch YouTube and he’s yelling at his brother to stop hogging the internet connection. But what is the probability Theo is actually playing Minecraft online, given that Carl’s connection has crashed?
My hypothesis is that Theo is playing Minecraft. My datum is Carl is complaining that the internet connection has crashed. Bayes theorem is usually written like this
H is hypothesis, D datum, P probability, and the little vertical line means that the datum after the line is to be accounted for when we’re trying to calculate the probability distribution of the hypothesis before the line.
So Bayes tells us
- The probability that the hypothesis is true (i.e. that Theo is playing Minecraft) given that we have observed the datum (the internet has crashed and Carl is yelling)
in terms of
- The probability that the internet has crashed given that Theo is playing Minecraft (the probability the datum is observed if the hypothesis is true). Notice the switch in order here; this is why it’s sometimes called Bayesian inversion.
- The probability the internet has crashed (the probability of observing the datum — this is usually a right pain, but we’re going to get rid of it shortly)
- The probability that Theo is playing Minecraft in the blissful case when we are utterly ignorant as to whether or not the internet has crashed. This is sometimes called the prior; it’s the probability without the information we’re trying to account for.
The genius of Bayes’ theorem is the inversion that gives the probability of hypothesis given datum, which we want to find, in terms of the probability of datum given hypothesis, which we can usually measure by doing experiments or find by processing historical data.
We’re used to see probabilities as numbers between 0 and 1 (or 0% and 100%), but when there are only two outcomes, we can also use odds: the ratio of the probability of one outcome to the other.
Before Carl started shouting, the probability Theo was playing Minecraft, P(H) was 25%. The probability he wasn’t playing Minecraft, which I’ll write P(~H), was 75%. The odds Theo was playing Minecraft was 1:3 (or three to one against), which we can also just write as odds = 1/3. He was three times more likely not to have been playing as to have been playing.
This is Bayes’ theorem for Theo not playing Minecraft. We’ll use this to work out how to change the odds of the hypothesis that Theo is playing Minecraft in the light of the datum that the internet has crashed.
The odds of Theo playing Minecraft given crashing is the probability of Theo playing Minecraft given crashing divided by the probability of Theo not playing Minecraft given crashing. We can find these odds just by dividing the two equations here (equation 1 / equation 2).
When we do this, jolly mathematical fortuities begin to emerge. The troublesome probability of the internet crashing disappears, and the blissful state of ignorance prior probabilities also end up making an odds. We get the following:
This is rather fine. It says that the odds of Theo playing Minecraft given Carl shouting about the internet connection increase with a factor that is exactly the ratio of the probability the connection crashes when he is playing to the probability the connection crashes when he isn’t playing. If the datum is more likely when the hypothesis is true than when it isn’t then the odds increases (the hypothesis becomes more likely) and vice versa.
I promised deep insight into the impact of evidence on probability in all branches of life, well here it is. What this equation tells you is that to understand how data change the odds of the truth of a hypothesis, you have to know both how likely you are to see that datum when the hypothesis is true and how likely you are to see it when it’s false.
Hypotheses can also be about things that are going to happen (or not); we’d usually called these outcomes.
So when successful people tell you that the secret to their success is, for example, meditating in the morning, this tells you nothing by itself. You have to look at unsuccessful people and see if they’ve also been meditating in the morning. The hypothesized outcome here is that you will be a success and morning meditation is being offered as a datum that supports that outcome. But if the incidence of meditation among unsuccessful people is similar to that among successful people then meditating doesn’t change your odds. (We’re not even getting started on the fact that even if the odds does change, that in no way supports the inference that the datum is causally related to the hypothesis).
If a management consultant tells you, you have to implement such and such a governance system or organization principle, because many of the most successful companies in your industry has done just that, ask them about the unsuccessful companies.
A much more serious example is the space shuttle Challenger, which exploded when O-ring seals on an auxiliary fuel tank failed due to the extreme cold on the morning of the launch. When the decision to launch was taken, it was noted that failures occurred across a range of temperatures. The hypothesized outcome here is that the seals will fail, but only data relating to cases of that outcome— failures — were discussed. Had the launch team seen data for the converse of that outcome — successful missions, they would have seen that the O-rings only ever held when it was warm.
Without this information, looking only at failures, it looked as if temperature wasn’t a factor. The probability it was cold when a seal failed was similar to the probability it was warm when the seal failed. But looking at when the seals held, it was clear that the probability it was cold in a success case was very small, so the odds of a failure when it’s cold is very high.
Apotheosis: Jaynes’ Final Formulation
So the odds form of Bayes equation is incredibly insightful (and easy to use), but Jaynes, eminent mathematician that he was, wasn’t entirely happy about the form of odds. Outcomes that are more likely than not spread themselves out between 1 and infinity, but outcomes that are less likely than not are crammed in between 0 and 1.
Jaynes realized that the logarithm of the odds is much more elegant and symmetrical. The logarithm of odds stretches from negative infinity for something that’s never going to happen to positive infinity for a sure thing. Odds of 1, exactly as likely as not, have logarithm 0. No information.
If you know about logarithms, you’ll know that the equation above quickly gives us the following. I’ve followed Jaynes in the choice of logarithm base and the factor 10, but they aren’t important, they just give handy numbers and the quiet nerdy satisfaction that the units of evidence are decibels.
If we define what Jaynes calls the evidence for an event H
then we have
This is an absolute pearl of an equation. The impact of data on evidence is linear.
All we’ve done is to take that big mess of 10 log odds etc in equation 4 and call it J (in honour of Jaynes). But this evidence J is just another way of writing a probability. If you give me a probability P(H), I can give you an evidence J(H) and vice versa. The relationship is shown here.
What Jaynes’ equation says is that to update an evidence in the light of data, you just add or subtract the middle term in equation 6 to your starting evidence. If your datum is more likely in the case that your hypothesis is true than in the case that it is not, the evidence for your hypothesis will go up (the fraction is greater than 1 so its logarithm is positive). If your datum is less likely in the case that your hypothesis is true, it will go down.
If you start with very low probability then the evidence has fallen off the bottom of the very steep part of curve to the left of the figure above. A strongly supportive datum (say 10 dB) will lift you up quite a bit, but because the curve is so steep there, it won’t move you very far in probability, i.e. horizontally.
This is because if the probability for a hypothesis is very low, it’s much more likely any supporting evidence is false positive evidence. This is why if the background rate of the incidence of illnesses is low, a positive test has to be extremely faithful to move the probability of being sick, i.e. the probability of true positive (D|H) must be very much higher than the probability of false positive (D|~H). The evidence formulation automatically takes care of this.
Unfortunately there is no equivalent to the evidence formulation for events with several outcomes. (Jaynes’ spectacularly leaves this as an exercise for the reader.) But in the binary case, the evidence formulation is incredibly powerful — both for the clarity it provides with respect to the impact of data and the effects of background rates, but also for separation of the characterization of tests from the probability of the things being tested.