Intro to Bayesian Probabilistic Reasoning

Published in

Labs Notebook

16 min readMar 9, 2023

By Jake Metzger and Alexandria Pabst — Accenture Labs

Setting the Scene

This post is an opinionated, whirlwind introduction to the motivations of Bayesian epistemology, probability, and statistical inference. We get into some philosophical and mathematical details here, which produce downstream scientific and technological differences as well. This will set the groundwork for our following discussions on the real-world applications of probabilistic reasoning.

Warning for the squeamish: There is some math ahead. But don’t be afraid, it’s cool math. Just look at Reverend Thomas Bayes here, rockin’ shades from the 1750's:

More seriously, just keep your eye on the high-level concepts as we walk through the math, and you’ll be just fine.

Bayes’ Rule

In probability, Bayes’ Rule stated below is a consequence of the standard definition of conditional probability (“The probability of X given Y”). It’s stated a few equivalent ways below for emphasis that will be explained afterward.

(1) Bayes’ Rule as the Transformation of Geometric Areas

Formula (1) above lends itself to a nice geometric interpretation as a transformation of areas. Let us suppose that we know P(X), P(Y), P(Y|X) and we want to know P(X|Y). Then we can construct the initial rectangle shown on the left. The left rectangle below has total dimensions (1, P(Y)) such that its total area is P(Y). Note that the P(Y) edge is the short edge. We further subdivide the short axis further into [P(Y|X) + P(Y|~X)]. Then, we subdivide the long axis into [P(X) + P(~X)]. Then, note the green sub-rectangle formed by P(X)P(Y|X) whose area is in terms of known quantities.

Then consider a similarly constructed rectangle on the right, but note the purple subrectangle which represents the area P(X∣Y), our target that we don’t yet know. In order to transform the green area to the purple area, note that we need to multiply the vertical axis by P(X)/P(Y) and multiply the horizontal axis by 1/P(X). This means that the purple area is equal to P(X)P(Y∣X)∗P(X)/P(Y)∗1/P(X) which, after simplification, is precisely Bayes’ Rule in formula (1).

(2) Bayes’ Rule as Reversing Conditional Probabilities

In formula (2), we can see that we can determine the conditional P(X∣Y) in terms of the conditional P(Y∣X) by multiplying by a ratio of the marginal (unconditional) probabilities P(X) and P(Y). That is, Bayes’ Rule allows us to reverse the direction of the conditional. That is, instead of conditioning X on Y, we can condition Y on X.

(3) Bayes’ Rule as Updating Probabilities based on Data

In formula (3), we can also see Bayes’ Rule as updating P(X) to P(X∣Y), conditioning X on Y by multiplying by a ratio of the likelihood P(Y∣X) and the marginal probability P(Y). We can interpret P(X) as an estimate that is ignorant of data Y, and that by conditioning X on Y, we obtain an updated probability for X that takes Y into account. Under this interpretation, P(X) is called the prior probability (think probability before data), and P(X∣Y) is called the posterior probability (think, probability after data).

“Traditional” Probability and “Bayesian” Probability

Bayes’ Rule, as a consequence of definitions, is used by everyone that makes reference to probability. However, what exactly “probability” means is up for some interpretation. One interpretation of probability is the traditional frequentist view which interprets probabilities as (roughly speaking) frequencies of events observed in the long run.

As an example, a frequentist would say that the probability that a coin lands HEADS is the hypothetical proportion of the number of heads to the number of experiments if we were to repeat the flips infinitely many times. A frequentist might also say that probabilities, properly speaking, belong to random events not hypotheses about model parameters. That is, there is no probability that, “The coin is a fair coin”. The hypothesis is either true or false — it is not a random variable under a frequentist interpretation of probability; it is not the output of a random data generation process with a long-run frequency. With this view, we cannot talk about which hypothesis is more likely, but we can talk about how unlikely the data we gathered would be under each hypothesis and choose the hypothesis that best fits the data we observed.

For so-called Bayesians, however, probability is less about hypothetical, long-run proportions of a random data generating process. Rather, Bayesians are concerned with our state of information and the dynamics of moving from a state of ignorance to a state of knowledge. That is, Bayesian probability is epistemic. Even under this umbrella there are different kinds of Bayesian probability, some kinds are more subjective and some are more objective, though Bayesian probability is frequently pigeonholed as only being the subjective kind. Such a characterization, though widely popular, is untrue for various reasons that do not need to be expanded here. The point is that the machinery of Bayesian probability is tied to understanding Bayes’ rule not only as a way to invert conditional probabilities (as in formula (2)) but as a way to update to a data-informed state of information from a less-informed state in a unified, probabilistically consistent way (embodied in formula (3)). To further clarify, consider the difference between synchronic and diachronic applications of Bayes’ Rule in the following subsections.

Synchronic Bayes

In the real world, there are two broad ways we acquire information. We either receive all of our relevant information at once, synchronically, or we receive it piecemeal over time, diachronically. For information received synchronically, everyone (Bayesian, non-Bayesian) agrees that probability estimates should follow formula (2), the synchronic interpretation of Bayes’ Rule.

The Marble Jar

Suppose I have a jar with 2 Yellow, 1 Red, and 1 Blue marble, equally sized and weighted.

All interpretations of probability will agree on things like the probability of picking a Blue marble (with replacement) after picking a Yellow one P({Y,B}∣{Y}). All the relevant information is available synchronically. To drive the point, we can use Bayes’ rule (formula (2)) to invert the conditional as follows:

Noting that in this case P({Y}∣{Y,B})=1, P({Y,B})=2/4∗1/4=2/16=1/8, and P{Y}=2/4, yielding the result 1/4, which is what we might guess without Bayes’ rule by a simple counting argument — we’re guaranteed to keep the first Y marble, so the probability of {Y,B} should just be the probability of pulling the Blue, which is equal to the ratio of the count of Blues (1) to the total count of marbles (4).

Again, each approach to probability agrees with this result, Bayesian and non-Bayesian.

Diachronic Bayes

However, there are some differing opinions when it comes to applying Bayes’ Rule diachronically, particularly starting from a state of ignorance.

The Mystery Marble Jar

Suppose I instead have a jar with 4 marbles that could be Yellow, Red, or Blue, but I don’t know in what proportions. What is the probability of picking a Yellow marble?

The traditional frequentist might say something like, “Because we don’t know the proportions of the marbles in the jar and we don’t (yet) have any pulls from the jar, we can’t say anything about the probabilities P(Y), P(R), or P(B). Without any data, we cannot compare any of the possible marble proportions in the jar”.

Those that subscribe to a Bayesian interpretation of probability, aka “Bayesians”, see probabilities as epistemic values as opposed to frequencies of a data generating process. A Bayesian in the above situation might utilize a Principle of Indifference to argue that, given their state of information, Y, R, and B are symmetric. And since P(Y) + P(R) + P(B) = 1, they can assign a tentative probability of 1/3 to each.

We could be more explicit about this assessment as follows. Without loss of generality, let us consider the probability of selecting a Yellow marble. We can note that there are 5 possibilities for the number of Yellow marbles (Ny) in the jar: 0, 1, 2, 3, and 4, and we are indifferent between them. However, by considering the possibilities for the other marbles (i.e. the state of the jar) by counting, we know that there is only 1 way for Ny=4, 2 ways for Ny=3, 3 ways for ways for Ny=2, 4 ways for Ny=1, and 5 ways for Ny=0. Taking the expectation via weighted sum, we have:

That is, we expect, on average, that there are 4/3 Yellow marbles in the jar. That makes our probability of drawing a Y out of 4 marbles P(Y)=1/3.

Let us now pull a marble from the mystery jar and have it be Yellow.

A traditional frequentist might say that, while there is a significant amount of uncertainty, the proportion of Y, R, and B that best fits this data point is that the jar is full of Yellow marbles (Y=4, #R=0, #B=0), noting that any other proportion would necessarily have a lower probability of yielding a Yellow marble. (Any real-life frequentist, though, would just refuse this estimate as essentially meaningless, which is perhaps fair enough in practice, though this seems a tad unprincipled given the apparent tension in moving from refusing to define any probability estimate to an estimate that is totally unbalanced based on a single data point.)

By contrast, the Bayesian above would just update their previous estimate to a new one reflecting the new data. First, let’s consider this update using Bayes’ Rule. After that, we’ll sanity check the result using the Principle of Indifference.

Our form of Bayes’ Rule is formula (3) above:

Note that P({Y})=P(Y) in this case. The numerator is calculated as follows:

Putting these together:

That is, the Bayesian has an updated probability estimate for drawing a Yellow marble up from 1/3 to 1/2 based on the available data. Note that this differs from the frequentist maximum likelihood estimate by 50%!

To understand why the Bayesian estimate makes sense, we can also look at calculating this new estimate directly, using the previously mentioned Principle of Indifference. Consider that we know, for sure, at least one marble is Yellow. Our uncertainty now spans evenly over the three remaining marbles in the jar, which, by the previous application, each have a 1/3 probability of being Yellow. So, if we consider directly estimating how many Yellow marbles are in the jar, we have 1+3(1/3)=2 out of 4 marbles, which is a probability of 1/2 for drawing a Yellow marble.

Isn’t Bayesian probability too subjective for real work?

Often one will hear Bayesian probability described as subjective, up to the whims of whoever is doing the inference. This isn’t strictly true: while subjectivist Bayesianism is perhaps the historically dominant Bayesian view, there are a host of modern objective interpretations and methods of Bayesian probability, such as the use of uninformative or maximum entropy prior distributions (I, tentatively, subscribe to the latter camp). The essence of Bayesian probability is not subjectivity; it is rather how to apply probabilistically consistent rules of reasoning given one’s state of information, whether one is man or machine.

Often this concern of subjectivity is driven by the dual assumption that traditional frequentist probability is more objective because it purports to represent observable, repeatable frequencies. Unfortunately this dual assumption is also false. One example of this in practice is the odd dependence of frequentist statistical results on the sampling intentions of the researcher — that is, whether a frequentist statistical result is “significant” intimately depends on the intentions of the researcher gathering the experimental data rather than on the data itself. Failure to account for this fact leads to “P-hacking”, a recurring bane for academic publication and experimental replication.

For example, if a frequentist has collected coin flip data with 22 HEADS and 13 TAILS, whether or not this result is significant depends on whether the researcher intended to stop at 35 coin flips or whether the researcher was operating under the (potentially unconscious) intention to flip the coin until their results reached a target significance threshold or otherwise operated under a rule that deviated from the experimental design assumed by the underlying statistical assumptions. That is, two researchers with the same data in hand can have different significance levels based on the private data collection intentions of the researcher at the time the data was collected. This is partly why, in modern practice, high-quality, frequentist data collection procedures must often be pre-registered. In the machine learning and data science world, we often have very little to no idea how a given dataset was gathered much less the sampling intentions of researchers producing a dataset. We hope that our datasets are gathered unproblematically, but in practice datasets are messy and biased and we have to find ways to tolerate or correct for these problems. Bayesian approaches rely on fewer contingencies from data collection, focusing instead on the data in hand as opposed to what data would have looked like if the dataset was repeatedly collected under the same sampling intentions.

Furthermore, it’s worth pointing out that even though Bayesian probability and frequentist probability disagree on the somewhat niche and philosophical question of the nature of probability, Bayesian probability and frequentist probabilities agree in the limit. That is, as more concrete data is provided, Bayesian probability estimates and frequentist probability estimates will converge to each other. This is true even for the truly subjectivist Bayesian who allows arbitrary (nonzero) prior distributions — insofar as one is probabilistically consistent in using Bayes’ Rule to update one’s probability estimates, data will eventually overwhelm a mistaken prior estimate. Note that this is precisely the same regime that frequentist guarantees are defined — the long-run limit.

Meanwhile in the small-to-medium data regime, where prior distributions do play a role, you get the natural regularization of a prior distribution (which can either be informed by previous expectations or uninformative), and you get the very natural propagation of uncertainty using Bayes’ Rule, rather than ad hoc frequentist uncertainty estimates which typically rely on the Central Limit Theorem, which is only strictly holds in the long run (that we’re never actually in, in practice).

Colin Howson and Peter Urbach in their classic defense of Bayesian scientific reasoning conclude the following,

Our view, and we believe the only tenable view, of the Bayesian theory is of a theory of consistent probabilistic reasoning. Just as with the theory of deductive consistency, this gives rise automatically to an account of valid probabilistic inference, in which the truth, rationality, objectivity, cogency or whatever of the premises, here prior probability assignments, are exogenous considerations, just as they are in deductive logic. (Scientific Reasoning: The Bayesian Approach, p. 301)

That is, Bayesian inference is fundamentally about probabilistic consistency in inference, and that in this way it’s as objective as the rules for deductive logic. Just as we do not expect deductive logic to solve the subjectivity of accepting the premises of an argument, we should not expect probabilistic inference to solve the purported subjectivity of defining prior distributions. Indeed, it can be argued that the dominant non-probabilistic alternative simply makes its sources of subjectivity more implicit, and thus more insidious.

(Also see Kolmogorov Axioms, Cox’s Theorem, Dutch Book, Principle of Indifference, Principle of Maximum Entropy Central Limit Theorem )

Bayesian Statistics: From Bayes’ Rule to “Bayes Rules!”

If you made it through the above section, Congratulations! Getting into a little of the nitty gritty is important to understanding what makes Bayesian stats and Bayesian ML different. On the one hand, Bayesian stats is just using Bayes’ Rule to answer statistical questions and, in that sense, is simpler than moving from probability theory to traditional, non-probabilistic statistics with seemingly ad hoc inference rules. But, I’ll explain (and oversimplify) below some of the highlights of a Bayesian statistical approach that may not be immediately obvious based on the above discussion.

First, Bayesian statistics aims at directly asking, “How likely is my hypothesis given my data?” This may be a small point, but it’s worth pointing out that traditional statistics does not directly answer this question. Instead, it answers the question, “How likely is my data given my hypothesis?”, which is the converse question. A traditional hypothesis test prescribes that if a test statistic extracted from the data is too unusual (as specified by a confidence level) under a null hypothesis that we should reject the null hypothesis. This procedure, when repeated, provides a certain long-run, probabilistic guarantee on its error rate, but it says little about whether the specific hypothesis you rejected is more likely than any alternative given your data.

A Bayesian analysis, however, can answer whether a default hypothesis is more likely than its alternatives, albeit with a different set of probabilistic guarantees. This is because a Bayesian analysis can treat competing hypotheses themselves as (epistemic) random variables about which we can be uncertain, as represented by a probability distribution.

Second, Bayesian statistics is concerned with posterior probability distributions as opposed to maximizing likelihoods to fit data. The result of a fully Bayesian analysis is a full distribution of values representing not only a most likely value but a distribution of values, in proportion to their probability.

As an example, consider Bayesian linear regression:

Here, we don’t just get a single fit line but a range of fit lines that are consistent with the given data, giving an intuitive sense of uncertainty about the true data generating process.

Third, a major theme in Bayesian statistics is uncertainty propagation. That is, typically we want to estimate not only the most likely value of a parameter but also have some measure of how reliable we take that estimate to be. Then, we want to plug that uncertainty into any downstream reasoning as well, keeping it in our final results. That is, if we’re uncertain about variable A, and A causes B which causes C, we should have some amount of uncertainty about C that propagates downward from A. Then if I later want to investigate variable D where C causes D, then, because we already have the uncertainty of C, we can just use Bayes’ Rule to propagate C’s uncertainty to D. Because Bayesian statistical approaches involve updating distributions to distributions with data, it’s easy to propagate uncertainty accordingly, like stacking building blocks of analysis.

However, traditional, frequentist approaches to uncertainty most typically utilize maximum likelihood estimates together with confidence intervals. Because confidence intervals do not represent distributions of uncertainty (just ranges that a value is either inside or outside), propagating that uncertainty forward into subsequent analyses is nontrivial. Also, as before, frequentist uncertainty typically represents the question converse to the one we actually care about: confidence intervals measure in what range we should expect to measure our test statistic in repeated trials under the null hypothesis, not how confident we should be in our null hypothesis given our data. Thus the forward propagation of uncertainty does not follow a simple rule in the same way that the above does. This also comes up, for example, in the contrast between Bayesian and traditional approaches to meta-analysis, where we try to combine information from multiple experiments into a single conclusion. The Bayesian approach to meta-analysis typically consists just of applying Bayes’ Rule whereas traditional meta-analyses have more complicated requirements (because, again, frequentist uncertainty is intimately tied to long-run frequencies generated by particular experimental setups and sampling intentions, not directly with what the actual data-in-hand says.)

Closing Remarks

In this introduction, we’ve discussed how a Bayesian approach to probability differs from the traditional approach by focusing on updating one’s state of information using Bayes’ Rule. We’ve also discussed how Bayesian statistics, as an extension of Bayesian probability, allows for the natural propagation of uncertainty through different parts of our analysis and how it enables us to directly answer the questions we care about (“How likely is my hypothesis given my data?”) rather than settling for point or interval estimation on the converse question (“How likely is my data given my hypothesis?”). We touched on some issues for non-Bayesians in scientific contexts and pushed back on the assumption that traditional probability is more “objective” than Bayesianism in practice. All of this sets the groundwork for accelerating a Bayesian approach using machine learning, deep learning, and graphical learning and integration with utility theory and causal inference, which will be covered elsewhere.

A Few Good Reads (Blogs)

https://statmodeling.stat.columbia.edu/
https://www.countbayesie.com/
https://doingbayesiandataanalysis.blogspot.com/
A nice side-by-side interactive demo by John Kruschke: https://iupbsapps.shinyapps.io/KruschkeFreqAndBayesApp/

Further Resources (Books and Articles)

Howson, C. and Urbach, P. (2006). Scientific Reasoning: The Bayesian Approach. Philosophy Series. Open Court.
Van Horn, K. S. (2003). Constructing a logic of plausible inference: a guide to cox’s theorem. International Journal of Approximate Reasoning, 34(1):3–24.
Chechile, R. A. (2020). Bayesian Statistics for Experimental Scientists: A General Introduction Using Distribution-Free Methods. MIT Press.
Pearl, J., Glymour, M., and Jewell, N. P. (2016). Causal Inference in Statistics: A Primer. Wiley.
Titelbaum, M. G. (2022). Fundamentals of Bayesian Epistemology 1: Introducing Credences. Oxford University Press.
Titelbaum, M. G. (2022). Fundamentals of Bayesian Epistemology 2: Arguments, Challenges, Alternatives. Oxford University Press USA — OSO.
Williamson, J. (2010). In Defence of Objective Bayesianism. Applied logic series. OUP Oxford.

Accenture Personae

Jake Metzger: Associate Principal R&D Scientist with Accenture Labs’ Systems and Platforms Team: https://medium.com/@jake.metzger
Alexandria Pabst: Associate Principal R&D Scientist with Accenture Labs’ Digital Experiences Team: https://medium.com/@alexandriapabst
Hayden Freedman: PhD Candidate in Software Engineering at UC Irvine; Research Associate with Accenture Labs’ Systems and Platforms Team: https://medium.com/@hfreedma
Neda Abolhassani: Principal R&D Scientist with Accenture Labs’ Systems and Platforms Team: https://medium.com/@neda.abolhassani
Ana Tudor: Technology R&D Specialist with Accenture Labs’ Systems and Platforms Team
Sanjoy Paul: IEEE Fellow and Managing Director of Accenture Labs’ Systems and Platforms Team