Priors to P-Values: Bayesian vs Frequentist Perspectives on Probability

by Dr. Anahita Talwar

Published in

Trusted Data Science @ Haleon

10 min readApr 8, 2024

In the domain of statistical inference, two prominent paradigms have been developed: the frequentist and Bayesian frameworks. Whilst these approaches diverge in their techniques for statistical inference, their key differences stem from fundamentally distinct interpretations of the kind of uncertainty that probability is being used to describe. In this comprehensive exploration, we delve into the foundational principles and philosophical underpinnings of both, shedding light on their respective strengths, weaknesses, and the ongoing debate surrounding their merits.

The Two World Views

At the heart of frequentist and Bayesian perspectives lie fundamental differences in philosophical interpretations of what probability is measuring. Let’s explore this with a classical coin toss example.

Imagine you’ve got a coin in your hand, and you’re about to flip it.

Now, a frequentist would say, “Hey, there’s uncertainty here because this is a random event.” From a frequentist perspective, the concept of uncertainty revolves around true or inherent randomness, and crucially, it is only uncertainty due to random events that probability can be used to describe. “The probability of getting a heads on a coin flip is defined as the proportion of heads observed if you were to flip the coin countless times. If you were to flip this coin over and over again, and it fell heads on half of them, the probability of getting a heads is 0.5.” In the case of a coin flip, they would consider probability of the outcome as the proportion of times that outcome appears when flipping the coin many times. More generally, a frequentist therefore views probability as the long-run relative frequency of a random event occurring over many trials. Importantly, regardless of who is flipping the coin or who is asked about the outcome, the experimental results would be expected to converge to a consistent probability value. When probability is used to describe random events, as in the frequentist approach, it is called objective probability.

In stark contrast, referring to the same scenario of a coin flip, a Bayesian might say, “The reason that we are uncertain about the outcome, is not because of inherent randomness — that doesn't exist — but because of our lack of complete information. If only I could meticulously analyse the coin tosser’s muscle movements in real-time. With that additional information, I might estimate a 70% chance that the coin will land heads. And if I could factor in other forces like air resistance, I might claim a 90% chance of heads.” In the extreme Bayesian view, everything in the universe operates according to deterministic principles — and uncertainty arises because we don’t have enough data, models, and understanding to perfectly predict outcomes. Further, if I then come in and assure the Bayesian that it’s a fair coin, they may plausibly incorporate this information into their probability estimate, taking it as true, or alternatively, they might say given my track record of embellishing the truth, they suspect foul play. Importantly, the confidence in predicting the outcome shifts again. When probability is used as a measure of uncertainty stemming from incomplete information or degrees of belief, as in the Bayesian approach, it is called subjective probability.

At this point, it is worth considering which of these views aligns with various descriptions of uncertainty in the real world, and whether one or the other feels more intuitive or more scientific to you.

The Frequentist Approach to Statistical Inference

Maximum Likelihood of Misinterpretation

As we saw, the frequentist views uncertainty through a lens of objectivity, only using it to describe inherently random events such as the outcome of a coin toss. Under this view, we therefore cannot assign probabilities to things that aren’t random, such as parameters or hypotheses. So, under this framework, and maintaining consistency with this definition of uncertainty, how do we get to an estimated parameter value such as the probability of getting heads on a single coin toss, ⍬?

If we take our coin, toss it ten times, and we get three heads, we can calculate something called the likelihood function, L(⍬). For each potential value of ⍬, the likelihood function computes the probability of obtaining this data p(D|⍬), using the binomial probability distribution in the coin toss example. In this way, we generate a curve that illustrates how likely the observed data is given a specific set of parameter values in a statistical model. The parameter value that maximises this likelihood curve (the maximum likelihood estimate) corresponds to the parameter values that make the observed data most probable under the assumed statistical model. Crucially, maximum likelihood estimation does not provide us with the probability of a parameter being true given our data, p(⍬|D), but rather offers a point estimate of the parameter value that maximises the likelihood of observing the data.

Diagram illustrating the likelihood function for ten coin tosses with three heads observed. The x-axis represents the probability of heads (p), while the y-axis represents the likelihood of observing the given outcome. The curve peaks at the maximum likelihood estimate of 0.3, indicating the parameter value most consistent with the observed data. — The likelihood function for ten coin tosses, with three heads observed. The x-axis represents the probability of getting heads in a single toss, while the y-axis represents the likelihood of observing the given outcome (3 heads in 10 tosses) for each probability value. The star at the peak shows the maximum likelihood estimate, indicating the parameter value most consistent with the observed data (0.3).

Although under the frequentist framework, probabilities are not assigned to nonrandom values like parameters, we can still gauge our confidence in the estimated parameters using confidence intervals. These signify the proportion of times, in repeated sampling, that the true parameter value will lie within the constructed interval. For example, a 95% confidence level gives us a way that if we were to construct 100 confidence intervals from 100 different experiments, we would expect about 95 of them to contain the true parameter value.

In a similar vein, hypotheses aren’t random so we can’t talk about the likelihood of hypotheses being true. Rather, frequentists use Null Hypothesis Significance Testing (NHST) to make binary decisions about the truth of hypotheses based on observed data. Stemming from the same concepts of repeated sampling, the approach defines an optimal test that will control long term error rates, such that in using this procedure, we will not often be wrong. This approach largely depends on calculating p-values, which quantifies the probability of obtaining the observed data or more extreme results under the assumption that the null hypothesis is true. Comparing the p-value to a predetermined threshold (usually 0.05) indicates whether the null hypothesis should be accepted or rejected.

This NHST framework underlies the basis of statistical inference in scientific research in academia, as well as drug development trials in the pharmaceutical industry, where extensive analysis plans are written in advance of any data collection. Despite it’s widespread use, the frequentist approach is often criticised for being extremely counterintuitive and difficult to teach or use correctly (so don’t worry if you felt a bit lost reading that section). It is particularly common for users to incorrectly talk about the probability of their inferred parameter being the true value, despite it being inconsistent with the frequentist take on what probability can and can’t measure, as previously discussed. This more intuitive conditional probability distribution of parameters given the observed data, that may feel like a more natural conclusion of inference, is not the same. To calculate the latter properly, you have to use Bayes’ theorem.

The Bayesian Approach to Statistical Inference

I’ll believe it before I see it

Named after the Reverend Thomas Bayes, Bayes’ theorem is a fundamental principle in probability theory that describes how to update our beliefs about the probability of an event as new evidence or information becomes available. It’s mathematical expression is given in the image below.

Equation representing Bayes’ theorem, with each component labeled. P(theta|D) represents the posterior probability, P(D|theta) is the likelihood, P(theta) is the prior probability, and P(D) is the marginal likelihood, or constant of integration. — Bayes’ theorem equation with each component labeled for clarity.

The Prior is a probability distribution that represents the initial belief or uncertainty about the parameters of a statistical model before observing any data. It is based on existing knowledge, experience or assumptions. The Likelihood, the function described in the frequentist approach, represents the probability of observing the data given a particular parameter value. It represents the consistency between the evidence (data) and parameter values. The constant of integration is the marginal probability of observing the data across all possible parameter values. It serves as a normalisation factor to ensure that the conditional probability is properly scaled. The Posterior is the updated probability distribution about the parameters of a statistical model after observing the data. This is the key conditional probability that may feel like a more natural conclusion of statistical inference, and one we can only get from using Bayes’ theorem. It is calculated by combining the prior probability with the likelihood of the observed evidence, thus the prior that we use will influence the resulting posterior distribution over our parameters.

Three examples of prior, likelihood, and posterior distributions, demonstrating the effect of different priors on the posterior distribution. Each distribution is labeled, highlighting the relationship between the prior, likelihood, and resulting posterior distribution. — Three plots depicting prior, likelihood, and posterior distributions. Each plot shows the effect of different priors (pink) combined with the same likelihood function (green) on the posterior distribution (blue). The variations in the prior distributions lead to distinct posterior distributions, demonstrating the impact of prior beliefs on the final inference. The distributions have not been normalised for visual clarity.

Bayes’ theorem provides a systematic framework for updating beliefs or probabilities based on new evidence, making it a powerful tool in various fields, including statistics and machine learning. In the Bayesian framework, probability represents uncertainty due to lack of information or degrees of belief, and so it can be used to describe almost everything, including specifying probability distributions over parameters and hypotheses (in contrast to the frequentist framework). This makes it a more intuitive and flexible approach, which can be useful when you have a strong reason for wanting to incorporate a prior, such as background knowledge or restricting parameter estimates within certain spaces. The prior can also help to regularise extreme parameter values that can result from maximum likelihood estimates when using small samples. Finally, the Bayesian framework allows for the use of more complex hierarchical models with which researchers can explicitly model the variability and uncertainty at different levels of the data hierarchy. For example, in a study involving patients nested within hospitals, a hierarchical model can capture not only the individual variability among patients but also the variability between hospitals. This approach is also applied in marketing strategies at Haleon, where the interplay between various brand and sub-brands or national and regional market dynamics necessitates a nuanced understanding. To delve deeper into how hierarchical models are leveraged in marketing contexts, check out our blog here.

The Bayesian approach is often considered as relatively new but it actually came about before frequentist methodology. Yet, the frequentist approach became more mainstream for two key reasons:

As backlash against the prior. It really bothered people that you have to come up with this prior, because it isn’t clear how to accurately write down subjective beliefs as a fully defined probability distribution. They also couldn’t get on board with the idea that you could get different results by choosing different priors — it didn’t feel scientific.
The constant of integration in Bayes’ theorem is something that you don’t hear about a lot, but it is often very hard to calculate. In these cases, the posterior distribution over your parameters is mathematically intractable, and using the Bayesian approach isn’t an option.

In 1990, a paper introduced Markov chain Monte Carlo (MCMC) sampling, a method that provided a way to estimate the posterior distribution without having to calculate the constant of integration, making it more accessible for Bayesian inference and alleviating the second problem. However, the first problem of using or coming up with a prior still remains a huge source of debate between statisticians today.

Colorful abstract representation with various squiggly lines, potentially representing different probability distributions.

The Debate

The ongoing debate between Bayesian and frequentist statisticians encompasses a range of nuanced arguments and perspectives that shape the practice of statistical inference. As previously alluded to, frequentists often raise concerns about the subjective influence of priors in Bayesian analysis, emphasising the importance of objectivity in scientific inquiry. They argue that the arbitrary selection of priors can introduce bias and undermine the credibility of statistical results. In response, Bayesians highlight the inherent subjectivity present in various aspects of experimentation, such as hypothesis selection and data collection methods. They advocate for embracing the inevitable influence of subjectivity and formalising it within the mathematical framework of Bayesian statistics.

Within the frequentist camp itself, debates also arise regarding the most robust methods for interpreting statistical results. Hardcore frequentists emphasise the necessity of adhering to predetermined significance thresholds to ensure the rigour and reproducibility of analyses. Conversely, other frequentists argue for a broader consideration of factors beyond statistical significance, such as effect sizes, which may offer more meaningful insights into the underlying phenomena of interest. In particular, the latter group often blames some of the perceived flaws of institutional science — high levels of bias in the publication system, and frequent failures to replicate results — on an overreliance on stringent p-values and the NHST framework.

Similarly, among Bayesians, debates centre on the choice of priors and the objectivity of Bayesian inference. Objective Bayesians advocate for the development of priors that are more objective and less influenced by subjective beliefs, aiming to enhance the reliability of Bayesian analyses. In contrast, subjective Bayesians contend that Bayesian statistics inherently involve the use of probability to represent beliefs or knowledge about the data. They maintain that subjectivity is an integral aspect of Bayesian inference and should be embraced as such.

These ongoing debates demonstrate the complexity and richness of statistical methodology, as researchers grapple with fundamental questions about the nature of probability, the role of subjectivity, and the interpretation of statistical results. By engaging in constructive dialogue and exploring diverse perspectives, statisticians continue to refine and advance the practice of statistical inference in pursuit of more robust and reliable scientific conclusions.