Why science is beautifully human

And very frail

9 min readDec 6, 2021

For most, the concept of ‘probability’ is not strange. The concepts ‘distribution’, ‘p-value’, and ‘statistically significant’ may also evoke some memories. To some, these memories were likely less pleasant, as statistics is not easy and quite counterintuitive. Hence the world famous statement: “We have lies, damned lies, and statistics”.

For several years now I have wondered aloud whether this statement would also be made if statistics and probability theory had been explained in a more natural way. A way in which we abandon the binary thinking that is inherent in statistical significance (p≤0.5) and focus much more on uncertainty. Especially, on how to reduce uncertainty. After all, science is about increasing knowledge by decreasing uncertainty.

Although I have a PhD, its my kids that have shown me what true unbiased experimentation looks like. With each repetition of whatever it is that they are doing, their knowledge base is visibly expanded. Hence, knowledge is not solid, but rather fluid, and certainly not linear. This makes sense as most, if not all, processes in the world are invisible.

This is one of the reasons why experiments are done in controlled environments (using old-fashioned Randomized Controlled Trials) which allow you to focus on a particular topic, but often translate very badly to the real world. Here, the strength of an RCT is also directly its main limitation as a measure of efficacy (maximum theoretical effect of an intervention) is not the same as a measure of effectiveness (real-world implementation).

A nice example is given when car manufacturers state the fuel consumption of a car under optimal conditions, which are never the same as in a real-world situation. Which makes the catalogued rate of fuel-consumptions meaningless.

The most common solution for reducing uncertainty is repetition. I doubt that THIS is why kids love repetition, but I do know that science lives by the grace of repetition. However, repetition is not sexy, not in science at least. Nobody wants to do a PhD by replicating another PhD despite its importance — if only to get an idea of the influence of invisible processes and the recognition of their existence.

In the beginning I talked about the p-value and statistical significance. The latter is a binary truncation of a continuous value (the p-value) and both try to ascertain whether an observation is due to a ‘real effect’ or a product of ‘probability’ (to be a product of probability is complete non-sense, but we will venture there another day). Entire books have been devoted to the word ‘probability’, as well as to the words ‘p-value’ and ‘statistical significance’. The latter belong to the Frequentist method of statistics and this is the method that everyone is given. In the Netherlands via Mathematics A. That is why it is often called ‘classical statistics’. The name is not only unfortunate, but also misleading because the method I am going to talk about now unites statistics with the way science is done, or at least should be. The childish method. Bayes’ theorem.

Before I delve deeper into Bayes’ theorem, I want to talk about Bayes’ paradigm. After all, as the word paradigm already betrays, it is a way of looking at opportunities, the burden of proof and uncertainty. Uncertainty in particular plays a key role in Bayesian doctrine. In contrast to the Frequentist method, the Bayesian method does not give a ‘yes’ or ‘no’ as an answer, but a whole series of answers. That sequence reflects uncertainty. The answer to the question whether an intervention has an effect is therefore not ‘yes’ or ‘no’ in Bayes' paradigm, but ‘depends’. For most policymakers, this answer automatically creates an uneasy feeling.

The aim of science and experimentation is to obtain a result that confirms (or disproves, but we prefer not to see) the hypothesis. I have just stated that this answer cannot be binary, nor is it conclusive. For example, there are simply too many unknown invisible factors to confirm a hypothesis in a single study. Randomization alone cannot solve this. Repetition is required and it is therefore better to think in terms of probability. Which is probably determined based on the knowledge you already had (which prompted you to do the experiment) and the knowledge gained through the experiment itself. So with your own experiment you add a bit of knowledge by removing a bit of uncertainty.

Now that I’ve explained what the Bayesian paradigm looks like, let’s delve into Bayes’ theorem. The theorem shows the relationship between two dependent probability distributions that hang on either side of the same coin. The statement comes from Reverend Thomas Bayes. Bayes’ theorem gives the probability of A after B has been observed. That probability is calculated by looking at the probability of A, the probability of B, and the probability B given A.

In formula form, Bayes’ theorem looks like this:

P(A | B) is the dependent probability: the probability of A if B is true. This is also known as the posterior probability.
P(B | A) is also a dependent probability: the probability of B if A is true. This is also called the likelihood of A given a fixed B because P(B | A) = L(A | B)
P(A) and P(B) are the probabilities of A and B, respectively. These are marginal probabilities which means that the probability is not dependent. This is called the prior probability.
P(B) is the normalizing probability. To make sure the posterior makes sense.

Visually it will look like this.

Let’s take an example. Suppose we want to calculate the distribution of water/land and let’s assume that we can find that distribution by throwing up a globe that is small enough to throw. Every time it falls on water we write a W and every time it falls on land we write a L. This example comes from the book “Statistical Rethinking” and is a nice deviation of the worn-out heads/tails example. We assume that all this happens in a vacuum. Let’s calculate the example using the statistical program R.

p_grid                   <- seq(from=0 , to=1   , length.out=20
prior                    <- dbinom(2, size=9 , prob=p_grid)
likelihood               <- dbinom(6, size=9 , prob=p_grid)
unstd.posterior          <- likelihood * prior
posterior                <- unstd.posterior / sum(unstd.posterior)plot(prior, ylim=c(0,0.5), main="Probability of water given Prior and Likelihood", ylab="Posterior Probability", xlab="Probability of Water")
lines(prior, col="black")
points(likelihood, col="blue")
lines(likelihood, col="blue")
points(unstd.posterior, col="darkgreen")
lines(unstd.posterior, col="darkgreen")
points(posterior, col="red")
lines(posterior, col="red")
text(locator(), labels = c("prior", "likelihood", 
                           "unstandardized posterior", "posterior")))

What I did above is the following: I throw the globe of the earth nine times. I’m assuming the probability W/L equals 2/9. This is my prior probability. When I observe the data I see that the proportion of W/L from my experiment is 6/9. These are my experimental observations. I then calculate what the conditional probability is — given my conviction and the data I have observed. The visual outcome is then, this time, as follows

bayesdata <- as.data.frame(cbind(p_grid, 
                                 prior, 
                                 likelihood, 
                                 unstd.posterior, 
                                 posterior)
                                  
ggplot(bayesdata, aes(x=p_grid))+
geom_area(aes(y=prior),fill="black",alpha=0.5)+
geom_area(aes(y=likelihood),fill="blue",alpha=0.5)+                                geom_area(aes(y=unstd.posterior),fill="darkgreen",alpha=0.5)+                                geom_area(aes(y=posterior),fill="red",alpha=0.5)+
ggtitle("Probability of water given Prior and Likelihood")+
xlab("Probability of Water") + 
ylab("Posterior Probability")+ theme_bw()

As you can see, the conditional probability depends on your prior belief and your observed data. See below two examples where I take the same likelihood, but a different prior. It gives a completely different result.

x=seq(0,1,0.01)
plot(dunif(x,0,1))

globe.qa                 <- quap(
  alist(
    W ~ dbinom( W+L , p) ,  # binomial likelihood
    p ~ dunif(0,1)          # uniform prior [0,1]
  ) ,
  data=list(W=6,L=3) )
precis(globe.qa )

  mean   sd 5.5% 94.5%
p 0.67 0.16 0.42  0.92

x=seq(0,1,0.01)
plot(dbeta(x,2,7))

globe.qa2                <- quap(
  alist(
    W ~ dbinom(W+L , p) ,  # binomial likelihood
    p ~ dbeta(2,7)         # beta prior - [ 2W & 7L ]
  ) ,
  data=list(W=6,L=3) )
precis(globe.qa2 ))

  mean   sd 5.5% 94.5%
p 0.44 0.12 0.24  0.64

We can extend the above to make it visual. It looks like this for the following examples:

x=seq(0,1,0.01)
plot(dunif(x,0,1))

globe.qa                 <- quap(
  alist(
    W ~ dbinom( W+L , p) ,  # binomial likelihood
    p ~ dunif(0,1)          # uniform prior [0,1] --> I do not care
  ) ,
  data=list(W=6,L=3) )
precis(globe.qa )

prior <- extract.prior(globe.qa, n=1e4); hist(prior$p)
likelihood               <- rbinom(1000,9,6/9)
post  <- extract.samples(globe.qa, n=1e4);hist(hist(post$p))

par(mfrow = c(1, 3),oma = c(0, 0, 2, 0))
hist(prior$p, main="Distribution of Prior");hist(likelihood, main="Data");hist(hist(post$p, main="Distribution of Posterior"))
mtext("Posterior Probability of finding water based on 2/9 prior and 6/9 data", outer = TRUE, cex = 1)

x=seq(0,1,0.01)
plot(dbeta(x,2,7))

globe.qa2                <- quap(
  alist(
    W ~ dbinom(W+L , p) ,  # binomial likelihood
    p ~ dbeta(2,7)         # beta prior - [ 2W & 7L ]
  ) ,
  data=list(W=6,L=3) )
precis(globe.qa2 )

prior <- extract.prior(globe.qa2, n=1e4); hist(prior$p)
likelihood               <- rbinom(1000,9,6/9)
post  <- extract.samples(globe.qa2, n=1e4);hist(hist(post$p))

par(mfrow = c(1, 3),oma = c(0, 0, 2, 0))
hist(prior$p, main="Distribution of Prior");hist(likelihood, main="Data");hist(hist(post$p, main="Distribution of Posterior"))
mtext("Posterior Probability of finding water based on 2/9 prior and 6/9 data", outer = TRUE, cex = 1)

The above example is something that many people have trouble with, because you can use the prior as a self-fulfilling prophecy. It is therefore important to substantiate the prior on knowledge obtained from good research. Furthermore, you can also weaken the prior by not making the weight you hang on it too big. You do this by spreading the distribution. Namely, the more possibilities are possible, the less a single possibility stands out. It all depends on how much you value uncertainty.

With this I would like to emphasize once again how important it is to recognize the subjectivity of science. Apart from the fact that subjectivity is an inherent part of the scientific process and that ignoring prior information (if explicitly available from previous research) would be downright unscientific, the main point to be made here is that this choice is not arbitrary.

By looking at statistics in this way, you can connect statistics much better with science. It provides a natural and principled way to combine prior information with data, within a solid theoretical framework for decisions. Information obtained in the past (by others) can thus be incorporated into a parameter and form a distribution for future analysis. When new observations become available, the previous posterior distribution can be used as a prior. All inferences follow logically from Bayes’ theorem.

Let’s continue the above example and use the result (the posterior) as the new prior. For the sake of convenience, let’s assume that the data shows exactly the same result (unbelievable!). Where do we end up then? The new prior looks like this:

par(mfrow = c(1, 2),oma = c(0, 0, 2, 0))
prior<-rbinom(10000,9,globe.qa2@coef[[1]])  # 0.4375002
hist(prior, main="Histogram of binomial(4,9") 
abline(v=(0.4375002*9), col="red", lty=2, lwd=1)
hist(rbeta(100000,4,5),main="Histogram of beta(4,5")
abline(v=(0.4375002), col="red", lty=2, lwd=1)
mtext("New priors based on old posterior as simulated via the binomial or beta distribution", outer = TRUE, cex = 1)

Lets extend the example and keep calculating.

globe.qa3                <- quap
  alist(
    W ~ dbinom(W+L, p) ,  # binomial likelihood
    p ~ dbeta(4,5)        # beta prior based on old posterior
  ) ,
  data=list(W=6,L=3) )
precis(globe.qa3)

  mean   sd 5.5% 94.5%
p 0.56 0.12 0.36  0.76

prior <- extract.prior(globe.qa3, n=1e4); hist(prior$p)
likelihood               <- rbinom(1000,9,6/9)
post  <- extract.samples(globe.qa3, n=1e4);hist(hist(post$p))

par(mfrow = c(1, 3),oma = c(0, 0, 2, 0))
hist(prior$p, main="Distribution of Prior");hist(likelihood, main="Data");hist(hist(post$p, main="Distribution of Posterior"))
mtext("Posterior Probability of finding water based on 4/9 prior and 6/9 data", outer = TRUE, cex = 1)

You can see that the distribution of the probability distribution of finding water is shifting as a function of what we have discovered before,and new data. We could do this many times over, and every time we do we get a little bit more knowledge. Or a bit more uncertainty, but uncertainty is ALSO knowledge.

post_old  <- extract.samples(globe.qa2, n=1e4)
dens(post_old,xlim=c(0,1));par(new=TRUE
post_new  <- extract.samples(globe.qa3, n=1e4)
dens(post_new,xlim=c(0,1), add=TRUE))

Through Bayesian analysis, we do not guess about specific values as in the traditional setting, but rather understand the limits of our knowledge and gain a healthy sense of the uncertainty of those guesses. A Bayesian logistic regression is still a logistic regression. The Bayesian part comes into play with the probability perspective one uses to interpret the results, and how the estimates are made.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Why science is beautifully human

And very frail

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Dr. Marc Jacobs