Bayesian inference in 1760

R Andrew Cocks
The Startup
Published in
7 min readJul 4, 2019

Derivation of the Beta Distribution

The year is 1760 and Reverend Bayes observes a lottery drawn 11 times. There are a total of 10 blanks and one prize observed or a 10:1 blank:prize ratio. Intuitively the true rate is probably somewhere between 9:1 and 11:1. As a scholar of probability Bayes wondered how accurate this intuition is. Can probably be quantified as a number, what is the chance that the true rate of the lottery falls between 9:1 and 11:1? Take a moment before reading on to take a guess.

Graphically we can represent this as a probability density function (PDF):

What’s the area under the curve?

Thought of a number? Most people intuitively feel a number between 60–90% is right but the actual chance is only ~8% ! This leads to two questions:

  1. How is this 8% derived?
  2. What difference would more draws make? Say 10 prizes and 100 blanks.

Let’s start with the first question.

Here’s what we know before the first lottery draw: nothing. The true rate lies somewhere between 0 and 1 on the horizontal axis but we have no information as to where so we evenly distribute the 100% chance such that the area under the “curve” is 1 thus creating a uniform distribution. This is the Prior before a single lottery draw has occurred.

Uniform prior, no knowledge

Bayes observed 1 prize and 10 blanks. We’ll assume the prize was the first observation so now let’s look at how to incorporate this new information.

The x axis is the true (unknown) rate of the lottery. The far left, x=0, suggests a true rate of 0% or a lottery with no prizes. But we’ve observed is a prize so we know that the probability of this being a no prize lottery (x=0) is zero (y=0). If the true rate of a prize is 0.5 (on the x axis) then the probability of observing a prize is 0.5 and on up to 1 creating a straight line:

1 prize

To update our knowledge to include this information and create the Posterior we need only multiply the Prior by the Likelihood.

Posterior = Likelihood × Prior

Since the Prior is uniform (a constant) the 1st Posterior is simply the first Likelihood scaled to a total area of 1. To perform the scaling the “curve” is divided by the area under the curve — the total probability under the prize curve. The first case is easy as it is a triangle. Later on the area under the curves can be approximated using the Riemann sum.

0 blanks, 1 prize

The next piece of information is a blank which is just a mirror of a prize.

First we multiply the Prior by the Likelihood. This means multiplying each point on the Prior by the corresponding point on the Likelihood.

Before scaling

Then we scale. The area under the curve is ⅓ so after scaling the y axis:

1 blank, 1 prize

The new Posterior has a maximum value at x=0.5 which is what we’d expect for the one blank and one prize observed so far. Also note that the all prizes possibility x=1 is eliminated by this new piece of information (a blank).

We continue adding the blanks through the same process and find the curve moves to the left.

2 blanks, 1 prize
5 blanks, 1 prize
10 blanks, 1 prize

As we add in the 10th blank the final curve peaks exactly where we would expect at 0.09 or 10:1. Now let’s go back and find the answer to the original question: What is the chance that the true rate of the lottery falls between 9:1 and 11:1? In decimal that’s the area under the curve between 0.83333 to 0.1.

Approximation using Riemann sum of two trapezoids under the curve

Sum the area of the green trapezoids
⅔ × (4.6 + 4.626) / 2 + (4.626 + 4.603) / 2 ≈ 7.7%

We’re done! We have solved the original problem posed and we know that given observation of 10 blanks and 1 prize the chance that the true rate of the lottery is between 9:1 and 11:1 is approximately 7.7%.

Ref: An Essay towards solving a Problem in the Doctrine of Chances pp 19

Integrating functions

Or are we done? The above is a numerical approximation and is a lot of work. But the steps involved are simple as the entire process is repeatedly multiplying by one of two linear functions (Bernoulli distribution) either:

Prize = x
or
Blank = 1−x

Then normalizing by the area under the curve to ensure an area of 1 in the Posterior. So for 1 prize and 3 blanks:

Prize × Blank^3 × normalizing constant
x × (1−x)^3 × normalizing constant

and we can see the general form of the equation is:

x^(count Prizes) × (1-x)^(count Blanks) × normalizing constant

To get the area under the curve we can integrate between 0 and 1. Let’s try that for 1 prize and 0 through 4 blanks.

Area under the curve for 1 prize and 0,1,2,3,4 blanks

This gives us the area under each curve. It also gives us the value of the probability density function at every point, the last one here is:

unscaled PDF = 5! / 3! × x × (1-x)⁴

Include the ⁴/₆ scaling (inverted) to get the Posterior:

PDF = 6! / 4! × x × (1-x)⁴

the factorial pattern (3!/1!, 4!/2!, 5!/3!, 6!/4! … ) of the areas can be extrapolated to the original situation of 1 prize and 10 blanks:

12! / 10! × x × (1-x)¹⁰

as a sanity test we can put in x=0.1 which we previously found was 4.603:

12 × 11 × 0.1 × (0.9)¹⁰ = 4.6025554

As expected, approximately the same as the numerical value.

But the objective wasn’t to find the peak Posterior, it was to calculate the exact value of the area under the curve between 0.083333 and 0.01 — to do that we can integrate the PDF to yield the cumulative density function (CDF):

= 12 * 11 * (x²/2 — (10 * x³)/3 + (45 * x⁴)/4–24 * x⁵ + 35 * x⁶ — 36 * x⁷ + (105 * x⁸)/4 — (40 * x⁹)/3 + (9 * x¹⁰)/2 — (10 * x¹¹)/11 + x¹²/12)

Simply plugging the numbers into the CDF at each relevant point:

CDF(0.1) − CDF(0.0833) = 0.341 − 0.264 = 0.077 or 7.7%

We’ve gotten the same result as the numerical method using algebra.

Deriving the Beta distribution

In the previous section we’ve derived a simple function to define the probability density:

x^(count Prizes) × (1-x)^(count Blanks) × normalizing constant

and we know the normalizing constant is the area under the curve. For the simple case of a single prize we found the pattern of this constant is:

(count prizes + count blanks + 1)! ──────────────────────────────────
(count blanks)!

The case where prizes is not equal to 1 is:

 (count prizes + count blanks + 1)!
────────────────────────────────────
(count blanks)! × (count prizes)!

which can also be determined through integration (not shown).

These functions comprise the beta distribution which takes two parameters α and β which are defined as:

α = number of prizes + 1
β = number of blanks + 1

Giving the completed beta distribution defined in terms of the Gamma (Γ) function:

Beta distribution

In Bayesian inference, the beta distribution is the conjugate prior probability distribution for the Bernoulli distribution.

Wrapping up

What about question 2:

What difference would more trials make? Say 10 prizes and 100 blanks.

The number of terms in the series become so numerous that it would require immense labour to obtain the answer but now that we’ve discovered the beta distribution we can simply use functions built into python for the larger tests:

Trying various sized lotteries 1:10, 10:100, 100:1000, 1000:10000 prize:blank while remembering from earlier:

α = number of prizes + 1
β = number of blanks + 1

I have rendered α and β as a and b in the python code.

Beta distributions using python’s scipy.stats

Thus we can see with increasing evidence the PDF tightens up showing the true rate of the lottery falling into increasingly narrow ranges. With 10 prizes and 100 blanks the true rate is 23.9% likely to fall between 9:1 and 11:1 and thus we have answered the second question.

Hope this has helped you to understand the relationship between Bayesian interference and the beta distribution.

Next: Naive Bayes and disease detection and Bayesian Ranking System

--

--