Bayesian Musings from a Non-Statistician
Ah-ha! moments in making sense of Bayesian statistics
This post is aimed at those who have already dipped their toe into Bayesian statistics without any rigorous theoretical training whatsoever i.e. myself.
Table of Contents
- “Probability and statistical inference have inverse objectives”
With probability modelling we set the parameters for a specified model and simulate potential outcomes. Statistical inference on the other hand ingests measurements from the world and backtracks to estimate the parameters of a chosen model.
The bidirectional relationship underpins the way we validate statistical techniques’ potential to capture the essence of reality: we synthesise data from a known distribution, blur the signal with some noise that represents random sampling error and estimate the true parameters, point estimates.
Similarly, Bayesians criticise and revise their models by analysing a posterior predictive distribution (c.f point estimates from before). These “posterior predictive checks” examine the deviation of data simulated by the fitted posterior model from the observed data.
2. “Conditional probabilities can be visualised”
It can be helpful to consider probabilities as having geometric representations — as areas within a complete probability space. Conditional probabilities “collapse” the probability space from the “universe” to that which the hypothesis is conditioned on.
Hypotheses are uncertain descriptions of what the world looks like. They can can either be discrete (innocence or guilt, which die did I roll) or continuous (mean level of gene expression). In either case Bayesians think of assigning probability distributions (prior and posterior) to hypotheses. There is nothing fundamentally different between the prior and the posterior — they represent your state of knowledge at any point in the process with the data you’ve observed so far.
More specifically we throw data into the likelihood (by initialising the random variable to a fixed value) and allow it to “tug” on the prior distribution of the likelihood’s parameters (both are functions of the unknown parameters, see Appendix A).
Primary Frequentist Differences
Frequentist statistics is characterised by changing your mind about default actions based on the “unsurprising-ness” of the data (the [in]famous p-value). Bayesians change their minds about prior beliefs instead. Frequentists need not form a belief to have a default action — this is purely something one commits to without analysing any data.
2. “Frequentists see the data as random and parameters fixed; Bayesians see parameters as random variables and data therefore as fixed random variables”
Inference from a frequentist’s perspective says that sampled data is drawn from a large or infinite sample space that we may take expectations over. The randomness of data gives the standard error in regression coefficient estimates — uncertainty in what would happen to the estimates if we observed more data sets that look like our own. This sampling error is the variance in a classifier’s predictions as discussed in more explicit machine learning contexts(regression is a part of the ML toolkit after all). A high variance means you’ve overfit on your one sample and have claimed to make sense out of the randomness present.
Bayesians reason instead that a probability distribution represents the unknown-ness of a quantity, not the randomness of the quantity from the data. The uncertainty in our state of knowledge of the system is described by our strength of “belief” (in the epistemological sense) about the true and unknown parameter values for the observed data thus far. Being pedantic, these true values are inherently still fixed values, not chosen at random from the probability distribution.
This distinction means that P(𝐻0|“anything”) makes no sense in classical hypothesis testing; the truth of a particular hypothesis is not a random variable but is true or it’s not. It is one of the mistakes in suggesting that a p-value is the “probability that an observation is due to chance”: statements like P(𝐻0|data) only have meaning in a Bayesian context but more strikingly the the p-value is computed assuming the null hypothesis is true (sample differences are caused by random chance).
“You must be 95% certain to send a citizen to jail” can be interpreted as:
P(Jail | Innocent) ≤ 0.05 by a Frequentist
P(Innocent | Jail) ≤ 0.05 by a Bayesian
1. “Bayes’ theorem ≠ Bayesian inference”
Bayes’ theorem is not a Bayesian approach per se. The theorem is a formula foundational to probability theory that connects inverse conditional probabilities. Its application allows easy access to conditional probabilities for events given background information.
Bayesian inference on the other hand is the reallocation of credibility across possibilities/hypotheses. In cases where we have complete knowledge of a probability space (that is mutually exclusive and exhaustive) and thus exact priors, we have the true long-run frequency of an event.
In the most trivial cases, data have definitive and deterministic relations to the candidate causes and so we can make an observation and completely rule out one or more possibilities e.g. a perfect fingerprint. In reality inferences are probabilistic — the data relate to their underlying causes with degrees of certainty and measurement process add noise to the underlying generating process. This better represents data analysis in the real-world and we now must supply “subjective” priors.
Examples of problems with complete knowledge and exact priors
In the cookie jar problem, where the identity of the jar (A, B, …) is the value of the parameter being estimated, it is trivial to get to the correct probability using natural frequencies without ever appealing to the theorem (30/[30+20]). We can get to the same answer using Bayes’ theorem if we consider the person selecting the cookie as part of the system and treat the identify of the jar as a random variable itself. In this case we must assume that a random choice is made and therefore uniform prior probability is spread across the jars. The Monty Hall problem also presents a similar case.
If we estimate our probabilities with sample proportions of outcomes and produce a posterior point estimate this is still a frequentist approach. However, unlike the Monty Hall problem we are now bringing in a disease prevalence prior — albeit an empirical one — and one can dispute the results by bringing in other priors. Similarly, the proper probabilistic treatment of Sally Clark’s innocence uses Bayes’ theorem to access the correct conditional value. True Bayesian inference is used when we must turn to statistical estimation using distributions (prior and likelihood) instead of point estimates.
2. “How Bayesian reasoning can give us both “correct” and also “subjective” probabilities”
It happens that credibility is measured on the same scale as probability and this Bayesian interpretation of probability doesn’t always sit well with frequentists, due to the subjectivity of prior. However it can also be argued that the likelihood is also subjective — Bayesian modellers commit a priori to a joint distribution of the data and parameters, p(θ, data), and not simply the prior. (Nonetheless the frequency guarantees of confidence intervals and the quantification of uncertain beliefs using credible intervals converge once there is enough information.)
It follows from 1. that despite the subjectivity of Bayesian inference, we can still all agree that Bayes’ theorem gives us access to the conditional probability of real interest. Take the canonical disease screening example. We must recognise that:
- Tests are not the event. We have a test for cancer, separate from the event of being diagnosed with cancer
- Tests are flawed. Tests detect things that don’t exist and miss things that do, leading to errors in the test probabilities
3. “Precision and Recall are directly related through Bayes theorem”
This becomes clear if we define precision and recall as conditional probabilities, for disease present D and positive test +ve.
Read from right to left, precision asks “given a +ve test, what’s the probability the disease is present” while recall asks “given a disease present, what’s the probability the test is +ve”. These are two very different things and wrong to assume that a screening test with high recall implies a high precision too — this is the prosecutor’s fallacy (explained visually next). One can have 100% recall but low precision due to false positives i.e. P(D|+ve) ≪ P(D|+ve). More realistically, the product of a small disease prevalence (large P(D’)) with a non-insignificant false alarm rate P(+ve|D’) can reduce the value of precision far below that of the recall.
4. “The effect of a small prior done visually” (the prosecutor’s fallacy)
A well-reasoned concern was published by a heart doctor on the new ECG feature on the Apple. The crux of his piece rested on the rarity of people with atrial fibrillation (about 1%) which would cause false alarms, and many false alarms given the number of sales one can expect from a company that generated more profit in 3 months than Amazon has ever done. The red line below left should be committed to memory by anyone who builds or uses classifiers or diagnostic tests; small priors have a dramatic effect on false positives even with tests that have a specificity = sensitivity = 99%!
One way to understand why this is is to consider a confusion matrix whose areas are proportional to a quadrant’s probability. If we artificially vary the prior D (normally a fixed estimate from the data) toward 0 we get a precision 1/(1+FP/TP) that tends to 0 as TP falls quickly due to its change in area whilst FP increases.
It’s worth pointing out that while precision (and NPV) are dependent on the population being tested — influenced by the prevalence of disease — sensitivity, P(+ve|D), and specificity, P(-ve|D’), are conditioned on the diseased/healthy state (they are determined horizontally above).
ML tangent: Since recall and specificity are conditioned on the true class label, ROC curves are invariant to the baseline probability (class imbalance) and this is a key distinction from PR curves for which precision is dependent. Where baseline probabilities of the problem are important, and the “positive” class is more interesting than the negative class, PR curves may be more useful in characterising classifier performance.
Another way to see the effect of a small prior is by representing the probabilities as Venn diagrams which allow the fractions to be visualised more readily.
The graphic shows again how a rare disease (small red area) with a test that has 99% sensitivity/recall (large intersection within red) and 99% specificity (large white area) can still yield many false positives and hence low precision (large green area).
I’ve left one final intuitive tool, using natural frequencies, in Appendix B.
5. Empirical Bayes: “Look at your data then set your priors”
Empirical Bayes is useful in ranking and analysing a large list of proportions such as CTRs and gene expression levels. Knowledge of the entire distribution is brought into each estimate and shrinks the credible intervals whilst pulling the expected values toward it, particularly those observations with smaller amounts of data. It may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model (see below) where the hyperpriors are set to their most likely values, instead of being integrated out.
6. “Prior information can formally inject common-sense into the equation”
In the contrived coin toss experiment with coin of bias p, the Bayesian’s input of common sense manifests as a “head start” in their maximum a posteriori estimate of the coin’s bias for a Beta distribution of potential biases,
with number of successes s and failures f. For a uniform prior, Beta(α=1,β=1), we revert to the intuitive maximum likelihood answer of s/(s+f) from the pure frequentist who assumes that the sample data is representative of the population of coin tosses because that’s all there is. This makes a difference with small sample sizes. “4 heads and 0 tails you say? The coin is 100% biased”.
However it should be said that since Bayesians have a full posterior distribution over parameter values there is a preference for expected values — integrating the full posterior to account for full uncertainty and potential multi-modality.
7. “Prior information can regularise a model”
Lasso regression constrains large coefficient estimates and shrinks some towards zero, effectively performing variable selection and choosing a simpler model. This avoids overfitting and is much better suited in dealing with datasets that contain collinearity.
Another way to phrase this is that we have a prior belief that most coefficients should be close to zero, and this can be expressed through a prior distribution (rather than through a cost function with a penalty term). This is exactly what can be achieved using Bayesian regression with the appropriate priors.
8. “Uncertainty in your neural network predictions”
Bayesian deep learning aims to move from point estimates of weights to full distributions for all latent variables. As a by-product we get to propagate through the uncertainty of random variables to the uncertainty of the prediction. Note that we effectively double the number of parameters we must learn (mean and spread).
Wide and deep neural networks represent more complex Bayesian models that can contain millions of parameters and in these instance Monte-Carlo methods are slow to converge — they may take weeks to discover the full posterior. To help solve this a technique called Variational Inference has been gaining momentum through probabilistic programming libraries such as Uber’s Pyro and Google’s Edward2.
- “Monte Carlo techniques are not just for Bayesians”
Random sampling can be used to produce a much more flexible null hypothesis significance test (NHST) framework than an analytic “off-the-shelf” model with full choice over the test statistic. With such a framework you are forced to make your model assumptions explicit in setting up the simulations. Moreover it is quick to prototype different test statistics and models that may turn out to yield different results — something potentially noteworthy.
For example, if you wanted to compute the p-value of the Hamming similarity between two vectors you could:
- Randomly shuffle your two vectors and compute the Hamming similarity (the null hypothesis is that the vectors are unrelated)
- Repeat step 1 many (N) times and measure the number of times (x) that the simulated statistic is larger than the observed value of the two original vectors
The proportion of times that the simulated vectors were more similar than the original vectors, x/N, is the (one-sided) p-value by definition.
2. “We still care about the normalising constant P(data) in sampling problems”
If we only care about point estimates in our posterior (e.g. find the MAP) we can drop P(data), a normalising constant, since we just care about exploring the parameter space to maximise lkl x prior. This is an optimisation problem: find θ that maximises the proportionality. P(data) is equally irrelevant if doing hypothesis comparison on the same dataset using the Bayes Factor.
This situation is different for MCMC sampling problems. A denominator is necessary to squeeze out fully-fledged probabilities. This is a sampling distribution problem: find P(θ|data), which requires integration over the entire sample space. This can be computationally difficult!
3. “The intractability of the normalising constant”
Notice how marginalising out θ in the denominator is the same problem as attempting to normalise the numerator with an unknown constant so that it integrates to 1. This is by deliberate setup of a “probability” — from the law of total probability the numerator will be a subset of the denominator that defines the scope of the entire space.
Consider a system of binary variables (say quantum spin states in the Ising model) where we seek a probability mass function of the spin configurations. Calculating the integral in the denominator (where x is the data) can be difficult even here since to produce a normalised probability we must sum over all of the possible spin configurations which increases exponentially with the number of variables. For a small 10 x 10 grid this is 2¹⁰⁰ different configurations to calculate the average magnetisation.
MCMC is our heuristic knight in shining armour — it allows us to sample from a distribution g(x)/k without knowing or caring about k and it does this by stepping through the probability space with a weighted random walk that’s biased toward the more likely states.
- “Hypothesis testing = Parameter estimation”
To answer questions that are posed as hypothesis tests, we can elegantly transpose the question to be one of parameter estimation and inferring the HPD. This style feels much more natural and in this way we get the effect sizes between groups. The PyMC3 library uses deterministic random variables (values fully determined by their parent values) to handle all the transformations necessary under the hood that ultimately output the desired P(B-A > 0).
NHST (say a Chi-square test or logistic fit) can at most tell us “these two things are not likely to be the same” (unless we bake a difference threshold into our hypotheses).
When running a test we’re often aiming to not just understand whether a variant will perform better, but by how much. For example, if testing whether a drug will improve IQ more than a placebo, if we have achieved a boost of 3 IQ points, even within our posterior credibility interval, can we really say that the drug is effective and worthy of approval?
The primary focus of NHST is avoiding Type I errors; there is a preference to stick to default actions (the null hypothesis is assigned with this assumption). If a new recommendation algorithm is only slightly better than the current model but has a p-value of 0.11, we might be tempted to discard it. However not all false positives are created equally. For example, choosing variant B with CTR 8% and A is 8.3% is a very different mistake to make than if A were 15%. Again it’s the magnitude that is also of interest.
4. “Remind me, what’s a p-value again?”
Interpretability is crucial in communicating ideas and articulating byzantine concepts like p-values takes some care. Would you rather report to your boss that “the probability that A is greater than B is 7%”, or perhaps “assuming the null hypothesis that A is equal to B, the probability that we would see a result at least this extreme in A versus B is 3%”?
Bayesians are also better equipped to posit stronger statements to make decisions from e.g. P(B-A > δ) where δ can be any difference we specify (set δ=0 for any improvement).
- “Assumption: observations are related to one another”
Hierarchical modelling can be seen as a middle-ground between pooling all measurements together and creating separate regressions for treatment groups. It is distinct from empirical Bayes since we are no longer estimating hyperpriors but are modelling their uncertainty through a hyperprior distribution.
Through such a design we impose a model structure with our assumptions that our treatment groups of observations are related to one another. Depending on the scenario this can be sensible, say justified using the biological concept of common descent. Everything we use to justify our model structure comes from assumptions extrinsic to the data — domain knowledge — that we still have to qualitatively and quantitatively verify. There is no free lunch; we have to describe our assumptions explicitly and encode them in the model structure or in the probability distributions used e.g. you might ultimately justify a parsimonious model following Occam’s razor.
“All models are wrong, but some are useful” — George E. P. Box
2. “Rare species are not alone”
When we have a new sample for which we have very few observations we are able to borrow evidence from the rest of the population to make more tightly-bound inferences about the new sample. In some cases the collection of new data is simply not feasible e.g the beak length of a new rare species of bird.
A non-hierarchical design in this scenario will give a nonsensically wide credible interval (characterised by the high posterior density [HPD] interval) for a rare species. This is because we impose identical uninformative priors on each species that “let the data speak for itself” thus leaving us with large uncertainties. In the non-hierarchical case, the prior parameters are point estimates and not drawn from a parent distribution.
3. “The generative story” and “information pooling”
The generative story is as follows: the population as a whole has a parental hyperprior distribution (requiring external justification of common ancestry) that constrains the region from which the prior parameters of individual species independently draw. While the arrows below indicate generative direction, Bayesian updating happens in the reverse direction. We can therefore see that while each species has its own μ in the likelihood as well as its own half-Normal prior distribution, all data is used to estimate the two hyperprior half-Cauchys at the very top. Hence information is pooled across groups!