Bayesian Musings from a Non-Statistician

Ah-ha! moments in making sense of Bayesian statistics

This post is aimed at those who have already dipped their toe into Bayesian statistics without any rigorous theoretical training whatsoever i.e. myself.

Table of Contents

101 Recap
Primary Frequentist Differences
Some Ah-ha!s
MCMC
AB Testing
Hierarchical Modelling

101 Recap

  1. “Probability and statistical inference have inverse objectives”

With probability modelling we set the parameters for a specified model and simulate potential outcomes. Statistical inference on the other hand ingests measurements from the world and backtracks to estimate the parameters of a chosen model.

The bidirectional relationship underpins the way we validate statistical techniques’ potential to capture the essence of reality measured: we synthesise data from a known distribution, blur the signal with some noise that represents random sampling error (nothing is said about measurement or biased sampling errors) and estimate the true parameters, point estimates.

Similarly, Bayesians criticise and revise their models by analysing a posterior predictive distribution (c.f point estimates from before). These “posterior predictive checks” examine the deviation of data simulated by the fitted posterior model from the observed data.

2. “Conditional probabilities can be visualised”

It can be helpful to consider probabilities as having geometric representations — as areas within a complete probability space. Conditional probabilities “collapse” the probability space from the “universe” to that which the hypothesis is conditioned on.

P(A|B) = P(A∩B)/P(B) = area fraction given by|AB|/|B|

3. “All hypothesis testing is asking is whether the evidence makes the null hypothesis look ridiculous”

Hypotheses are uncertain descriptions of what the world looks like. They can can either be discrete (innocence or guilt, which die did I roll) or continuous (mean level of gene expression). Either way Bayesians think of assigning probability distributions across the hypotheses— prior and posterior distributions to be precise. There is nothing fundamentally different between the prior and the posterior — they represent your state of knowledge at any point in the process with the data you’ve observed so far.

The notion of “updating the prior” is not just a Bayesian cliché but a sensical description of the process, and large amounts of data cause two initially very different prior beliefs to reach consensus.

Primary Frequentist Differences

  1. “Statistics is the science of changing your mind”

Frequentist statistics is characterised by changing your mind about default actions based on the “unsurprising-ness” of the data (the [in]famous p-value). Bayesians change their minds about prior beliefs instead. Frequentists need not form a belief to have a default action — this is purely something one commits to without analysing any data.

2. “Frequentists see the data as random and parameters fixed; Bayesians see the data as fixed and parameters as random variables”

Inference from a frequentist’s perspective is predicated on the notion that sampled data is drawn from a large, or infinite, sample space that we may take expectations over. The randomness of data gives the standard error in regression coefficient estimates — uncertainty in what would happen to the estimates if we observed more data sets that look like our own. This sampling error is the variance in a classifier’s predictions as discussed in machine learning (regression is a part of the ML toolkit after all). A high variance means you’ve overfit on your one sample and have boldly claimed to make sense out of the randomness.

Bayesians believe the data is fixed and it is mutually-exclusive hypotheses over which a probability distribution exists; the uncertainty “captured” in our state of knowledge of the system quantifies our strength of belief in the parameter values themselves for the observed data thus far.


Some Ah-ha!s

1. Bayes’ theorem Bayesian inference

Bayes’ theorem is not a “Bayesian approach” per se. The theorem is simply a mathematical formula that connects inverse conditional probabilities. Its application allows easy access to conditional probabilities for events given background information, typically in a finite sample space. For example, in the Monty Hall problem we pick one of three doors at random giving us a prior of 1/3 for each door. No statistics is involved here.

Prior and posterior probability spaces for the Monty Hall problem of three mutually-exclusive events. Bayes’ theorem makes the complexity of the problem more manageable.

In addition, if we estimate all our probabilities with sample outcome proportions to produce a posterior point estimate, this is still a frequentist approach. The proper probabilistic treatment of Sally Clark’s innocence requires Bayes’ theorem to access the conditional value of real interest, and this is a frequentist approach with point estimates.

Bayesian inference is predicated on the notion that we can invoke prior and posterior distributions on the unknown parameters. If we supply distributions for the prior and likelihood (both as functions of the unknown parameters) instead of point estimates we have ourselves a Bayesian approach that brings along all the inference baggage; a probabilistic interpretation of random variables and subjective priors. (Yet it can be argued that the likelihood is also subjective — Bayesian modellers commit a priori to a joint distribution of the data and parameters, p(θ, data), and not simply the prior).

In summary, frequentist inference and Bayesian inference are defined by their goals, not their methods. The frequency guarantees of confidence intervals and the quantification of uncertain beliefs using credible intervals converge once there is enough information.

2. “Precision and Recall are directly related through Bayes theorem”

This becomes clear if we define precision and recall as conditional probabilities, for disease present D and positive test +ve.

Precision: P(D|+ve)

Recall (TPR): P(+ve|D)

Read from right to left, precision asks “given a +ve test, what’s the probability the disease is present” while recall asks “given a disease present, what’s the probability the test is +ve”.

In the canonical disease testing example it is misleading to think that a high recall test implies a high precision. One can have 100% recall but low precision with many false positives. i.e. P(D|+ve) ≪ P(D|+ve)

3. “The effect of a small prior/the prosecutor’s fallacy, done geometrically”

A well-reasoned concern was published by a heart doctor on the new ECG feature on the Apple. The crux of his piece rested on the rarity of people with atrial fibrillation (about 1%) which would cause false alarms, and many false alarms given the number of sales one can anticipate from a company that generated more profit in 3 months than Amazon has ever generated. The graph below left should be committed to memory by anyone who builds or uses classifiers or diagnostic tests; small priors have a dramatic effect on false positives even with tests that have a specificity = sensitivity = 99%!

Left, probability of a true disease given a positive test, with specificity = sensitivity = 99%, revealing dramatic variation with the magnitude of the prior. Right, grey areas are the quadrants needed to calculate P(D|+ve), which is is underdetermined from just the sensitivity and specificity of a diagnostic test. The sensitivity controls the boundary between TP and FN and specificity the boundary between TN and FP.

One way to see why this is the case is to consider a confusion matrix whose areas are proportional to each quadrant’s probability. I’ve staggered the + category for ease in representing equally high specificity and sensitivity (99%). We “artificially” vary the prior D toward 0 (normally this is a fixed estimate from the data). This gives a precision (1/[1+FP/TP]) that tends to 0 — TP falls very quickly due to its area whilst FP increases.

Another way to see this is by representing the probabilities as Venn diagrams which allow the fractions to be visualised more readily.

A: atrial fibrillation; B: Positive ECG. P(A|B)<>P(B|A)

The visual shows that a small prior (rare disease), even with 99% sensitivity (large overlap inside red, high recall) and 99% specificity (large white area) can still yield many false positives (large green area).

4. Empirical Bayes: “Look at your data then set your priors”

Empirical Bayes is useful in ranking and analysing a large list of proportions such as CTRs and gene expression levels. Knowledge of the entire distribution is brought into each estimate and shrinks the credible intervals whilst pulling the expected values toward it, particularly those observations with smaller amounts of data. It may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model (see below) where the hyperpriors are set to their most likely values, instead of being integrated out.

5. “Prior information can formally inject common-sense into the equation”

In the contrived coin toss experiment with coin of bias p, the Bayesian’s input of common sense manifests as a “head start” in their maximum a posteriori estimate of the coin’s bias for a Beta distribution of potential biases,

with number of successes s and failures f. For a uniform prior, Beta(α=1,β=1), we revert to the intuitive maximum likelihood answer of s/(s+f) from the pure frequentist who assumes that the sample data is representative of the population of coin tosses because that’s all there is. This makes a difference with small sample sizes. “4 heads and 0 tails you say? The coin is 100% biased”.

However it should be said that since Bayesians have a full posterior distribution over parameter values there is a preference for expected values — integrating the full posterior to account for full uncertainty and potential multi-modality.

6. “Uncertainty in your neural network predictions”

Bayesian deep learning aims to move from point estimates of weights to full distributions for all latent variables. As a free byproduct we get to propagate down the uncertainty of random variables to the uncertainty of the prediction. Note that we effectively double the number of parameters we must learn (mean and spread).


MCMC

  1. “Monte Carlo statistical techniques are not just for Bayesians”

Random sampling can be used to produce a much more flexible null hypothesis significance test (NHST) framework than an analytic “off-the-shelf” model. With such a framework you are given full choice over the test statistic and are forced to make your model assumptions explicit when creating a simulation. Moreover it is quick to prototype different test statistics and models that may turn out to yield different results — something noteworthy in and of itself.

2. “We still care about the normalising constant P(data) in sampling problems”

If we only care about point estimates in our posterior (e.g. find the MAP) we can drop P(data), a normalising constant, since we just care about exploring the parameter space to maximise lkl*prior. This is an optimisation problem: find θ that maximises the proportionality. P(data) is equally irrelevant if doing hypothesis comparison on the same dataset using the Bayes Factor.

This situation is different for MCMC sampling problems. A denominator is necessary to squeeze out fully-fledged probabilities. This is a sampling distribution problem: find P(θ|data), which requires integration over the entire sample space. This can be computationally difficult! MCMC is our heuristic knight in shining armour — it allows us to sample from a distribution g(x)/k without knowing or caring about k.

3. The intractability of the normalising constant

Notice how marginalising out θ in the denominator is the same problem as attempting to normalise the numerator with an unknown constant so that it integrates to 1. This is by deliberate setup of a “probability” — from the law of total probability the numerator will be a subset of the denominator that defines the scope of the entire space.

The problem of intractability can be illustrated by considering a system of binary variables (say quantum spin states) that may interact between themselves. To compute the normalising constant you must sum over all 2^N permutations which explodes exponentially with the number of variables N.


AB Testing

  1. “Hypothesis testing = Parameter estimation”

From a Bayesian perspective, both hypothesis testing and parameter estimation are two sides of the same coin. PyMC3 uses deterministic random variables (values which are fully determined by their parent values) to handle all the transformations necessary under the hood that ultimately output the desired P(B > A) = P(B-A > 0).

2. Magnitude is more important than statistical significance

NHST (say a Chi-square test or logistic fit) can at most tell us “these two things are not likely to be the same” (unless we bake a difference threshold into our hypotheses).

When running a test we’re often aiming to not just understand whether a variant will perform better, but by how much. For example, if testing whether a drug will improve IQ more than a placebo, if we have achieved a boost of 3 IQ points, even within our posterior credibility interval, can we really say that the drug is effective and worthy of approval?

3. “The frequentist approach is more risk averse in decision making”

The primary focus of NHST is avoiding Type I errors; there is a preference to stick to default actions. If a new recommendation algorithm is only slightly better than the current model but has a p-value of 0.11, we might be tempted to discard it. However not all false positives are created equally. For example, choosing variant B with CTR 8% and A is 8.3% is a very different mistake to make than if A were 15%. Again it’s the magnitude that is also of interest.

4. “Remind me, what’s a p-value again?”

Interpretability is crucial in communicating ideas and articulating byzantine concepts like p-values takes some care. Would you rather report to your boss that “the probability that A is greater than B is 7%”, or perhaps “assuming the null hypothesis that A is equal to B, the probability that we would see a result at least this extreme in A versus B is 3%”?

Bayesians are also better equipped to posit stronger statements to make decisions from e.g. P(B-A > δ) where δ can be any difference we specify (set δ=0 for any improvement).


Hierarchical Modelling

  1. “Assumption: observations are related to one another”

Hierarchical modelling can be seen as a middle-ground between pooling all measurements together and creating separate regressions for treatment groups. Through such a design we impose a model structure with our assumptions that our treatment groups of observations are related to one another. Depending on the scenario, this can be sensible assumption (e.g justified using the biological concept of common descent) or a weak one.

Everything we use to justify our model structure comes from assumptions extrinsic to the data — domain knowledge — that we still have to qualitatively and quantitatively verify. There is no free lunch; we have to describe our assumptions explicitly and encode them in the model structure or in the probability distributions used e.g. you might ultimately justify why you apply Occam’s razor in designing such a model.

“All models are wrong, but some are useful” — George E. P. Box

2. “Rare species are not alone”

When we have a new sample for which we have very few observations we are able to borrow evidence from the rest of the population to make more tightly-bound inferences about the new sample. In some cases the collection of new data is simply not feasible e.g the beak length of a new rare species of bird.

Before (left) and after (right) applying a hierarchical design — we see “variance shrinkage” in our estimates of the unknown species’s beak depth.

A non-hierarchical design in this scenario will give a nonsensically wide credible interval (characterised by the high posterior density [HPD] interval) for a rare species. This is because we impose identical uninformative priors on each species that “let the data speak for itself” thus leaving us with large uncertainties. In the non-hierarchical case, the prior parameters are point estimates and not drawn from a parent distribution (e.g. uppermost half-Cauchys below).

3. “The generative story”

The generative story is as follows: the population as a whole has a parental hyperprior distribution (requiring external justification of common ancestry) that constrains the region from which the prior parameters of individual species independently draw. Hyperpriors one level up are good enough here to capture the hierarchical structure.

https://github.com/ericmjl/bayesian-stats-modelling-tutorial

Somewhat paradoxically by introducing more parameters we ultimately relax the constraints put on the model.