Trust the Process, Doubt the Procedure

NBA playoff win chances via Bayesian inference

A survey of methods in R with code

Daniel McNichol
Coεmeta

--

This is a parable about simple, straightforward questions of fact, & how they often devolve into complex matters of data processing, analysis & decision-making under fragile epistemic limits, in the real world.

(This is Part 3 of 4, on Bayesian alternatives to frequentist hypothesis tests.
Each post stands on its own.
Parts 1 & 2 are summarized in the brief recap below.
Part 4 on decision theory is here.)

Run it Back Twice: Parts 1 & 2 Recap

Wagers, Probability, Data Wrangling & Frequentist Statistical Tests

via NBC Sports

In Part 1, I described a bet between myself & my friend / colleague Nat, in which he asserted that away teams in NBA Playoff series are more likely to win the series if they WIN & THEN LOSE the first two away games, than if they split the first two games in the opposite order, despite an identical series score heading into game 3. I demurred, sensing a gambler’s fallacy at play:

I go on to describe my reasoning & the fundamental probability theory underlying it, which I won’t repeat here.

I also describe the semi-complex data collection & processing required to address the question at hand, utilizing 538’s excellent Historical NBA Elo dataset (CC BY license), then surfaced the results & initial summary statistics in an interactive Data Studio dashboard, reproduced below (with default filters set to conditions best suited to the original wager, under modern NBA Playoff rules — post-2002).

As can be seen, between 2003–2015, over 84 playoff series, away teams splitting the first 2 games of the series had essentially the same win %, regardless of the order of the win & loss (39.1% to 39.5%).

I was tempted to declare victory here, but decided to try to salvage the 1984–2002 best of 7 playoff series to increase the sample size. You can see for yourself by adjusting the filters above, but here’s the topline results:

Now things get a bit muddled. At a sample size of 134 total series, the win-then-lose segment has pulled into a 43% to 36% advantage, all of which accrued between 1984–2002 (since we’ve already seen it was tied since 2002):

Between 1984–2002, the first round of NBA Playoffs were best of 5 series, which are excluded here & thus explains some of the lower incidence of series beginning with split results over that period. (The first round is also by nature the round with the most total series in play, & thus results in the most series lost from our sample by excluding it).

But is this advantage real? Or simply expected random fluctuation around otherwise equal winning chances?

This sounds like a question about statistical significance, but, as you might have heard, scientists are rising up against it, as the (mostly Bayesian) statisticians have long advocated.

In Part 2 I explored why, showing how traditional frequentist statistical tests unanimously support my side of the bet, but suffer from chronic issues of hidden assumptions, misunderstanding & mechanical misapplication.

via xkcd

Cool cool. Traditional so-called Null Hypothesis Significance Testing (NHST) is on the outs, but surely the Bayesian alternatives will save us?

oh…oh no.

Well then.

How to proceed?

Bets must be settled. Decisions must be made. Science must advance.

Enter the hairy field of decision theory/science/analysis.

Comparative Data Analysis

Bayesian approaches

Bayesian statistics represent a fundamentally different philosophical orientation to questions of probability than that of traditional frequentist statistics. Historically, frequentism arose later than Bayesianism but came to dominate scientific practice in the 20th century, before meeting new skepticism around the early 21st century. The technical & philosophical differences, though fundamental, can be subtle & abstruse to the uninitiated, & are beyond the scope of this post (but see here for more).

via unsplash

For our purposes, there are two major points of differentiation:

  1. Explicit vs hidden assumptions
    In part 2, we saw how fragile frequentist methods can be to obscure, unexamined assumptions & naive conventions. Bayesian methods also require assumptions (notably in the form of ‘prior probabilities’), but make them explicit & force us to reckon with them by formally specifying them in each application.
  2. Intuitive vs inscrutable outputs
    In part 2 we also grappled with some of the curious creatures of frequentist inference: p-values (which nearly no one can clearly explain), statistical power (which is typically either ignored or misunderstood), multiple comparison corrections (in which the validity & interpretation of a test result changes simply by running another test on the same data, or intending to, or by “peeking” at results before all data is collected) & null hypotheses (which we can never truly prove or accept). Bayesian methods, by contrast, output a probability estimate which means what most people would expect it to mean: the likelihood of being true, or an appropriate degree of belief, given the parameters of the analysis. Thus it does not require the contortionist language & logic of frequentism, but can be stated more plainly & intuitively, & so can be more readily understood by a variety of audiences.

So let’s see it in action.

Below, we’ll survey the following 3 R packages, then suggest further resources for going deeper:

  1. BEST (Bayesian Estimation Supersedes the t-test)
  2. BayesFactor
  3. BayesAB

Bayesian hypothesis testing

As alluded to above, statistical significance & hypothesis testing are foreign concepts to ‘pure’ Bayesian analysis, but there have been efforts to translate these conventions to leverage the advantages of Bayesian methods without requiring a complete paradigm shift for practicing (data) scientists.

BEST
One such approach developed by John Kruschke is explicitly (if presumptuously) christened “Bayesian Estimation Supersedes the t-test” or BEST. In addition to the R package, John created a wealth of supporting resources, including videos, a journal article & an entire book.

From his article’s abstract:

Bayesian estimation for 2 groups provides complete distributions of credible values for the effect size, group means and their difference, standard deviations and their difference, and the normality of the data. The method handles outliers. The decision rule can accept the null value (unlike traditional t tests) when certainty in the estimate is high (unlike Bayesian model comparison using Bayes factors). The method also yields precise estimates of statistical power for various research goals. The software and programs are free and run on Macintosh, Windows, and Linux platforms.

Usage of the BEST method is straightforward (if a bit limited & inflexible) as it is primarily intended as a simple Bayesian drop-in replacement for the common t-test in traditional hypothesis testing scenarios.

Following the vignette for installation & basic usage, we can apply BEST with vanilla default settings as follows:

Main output:

What does this tell us?

The “difference of means” of our two groups is the difference in series win % between the win-then-lose teams & lose-then-win teams. We already knew this difference was about .08 or 8%, simply the difference between 43.3% & 35.8%. But this model gives us a probability distribution around that average difference, which we can use to quantify our degree of believe or level of certainty in this difference — or any other difference we might wish to consider — as well as establish a quantitative decision rule for the purpose of accepting or rejecting hypotheses of interest.

For instance, the plot also specifically annotates the probability mass (or ‘density’) on either side of 0 (19% < 0 < 81%). This essentially quantifies a probabilistic estimate that one of these groups’ true win % is above or below the other’s. Per this model output, there’s an 81% chance that the W_L group has a higher true win % than the other.

So does Nat win the bet? Not quite.

For one thing, you’ll also notice the “95% HDI” bold black line along the bottom of the histogram. That is the Highest Density Interval, which shows the range of values containing most of the distribution (95% here), where every point inside the interval has higher probability than any point outside the interval. This is an example of a credible interval, which is a Bayesian counterpart to frequentist confidence intervals, & can be used for similar purposes.

Here, the 95% HDI overlaps 0, meaning a conventional decision rule would conclude that a hypothesis alleging a “true difference in win %” can not be accepted with high confidence. Further, we see that the 95% credible interval spans from -10% to +25%, a wide range of probable differences in win %.

What if we narrow the HDI to only 80% of the most probable values?

plot(BESTout, credMass = .8)

The 80% HDI still includes 0 or “no difference”, ranging from -4% to +19% difference in win %. (We’d have to lower our confidence level to about 60% before the HDI no longer contains 0, which is only marginally better than a coin flip.)

So far we’ve been using this Bayesian model in a very frequentist way, essentially simply attempting to reject a null hypothesis (“0% difference”) at various (arbitrary) probabilistic thresholds (95% or 80% intervals). This doesn’t take advantage of one of the major benefits of Bayesian analysis: the full posterior probability distribution (as opposed to simple point estimates at various ‘confidence’ or ‘significance’ levels).

For instance, what if we’re only interested in the probability of a particular range of differences in win %? E.g., “How probable is a win % difference greater than 5% in either direction?”. We can answer that question by specifying a “ROPE”, or “Region of Practical Equivalence” around 0:

plot(BESTout, ROPE = c(-0.05,0.05) )

This tells us that there is a 30% probability that the true difference of means (i.e. win %) is between -5% & +5%, which we have defined as “practically equivalent to zero”. This leaves a 70% probability that the true win % difference is beyond 5% in either direction.

But the original wager was actually that the W_L teams would have a higher win % than L_W teams, & at one point we did specify a threshold of at least +5%. We can also produce a precise probability estimate for this by specifying a comparison value of .05:

plot(BESTout, compVal = .05)

We see that there is a 62.5% probability of a > 5% difference in win % (again, closer to a coin flip than a sure thing).

So where does this leave us?

We’ve noted that (per this default application of BEST):

  • There is an 81% probability that the W_L group has a higher true win %
  • Yet 95% & 80% HDIs both contain 0, meaning we can’t rule out “no difference in win %” with high confidence. (In fact, we’d need to lower our ‘confidence’ level to 60% to pass this threshold)
  • There’s a 30% probability that the true difference in win % is within 5% of 0 in either direction, which we might define as “practically 0”. But more relevantly, there’s a 62.5% probability that W_L teams have a >5% greater true win % than L_W teams.

This gives us many probabilistic estimates, but no clear or definitive verdict. This is, in fact, the double-edged sword of Bayesian inference: it permits much more flexible, intuitive & nuanced probabilistic analysis of scientific questions, but is less naturally suited to simplistic binary determinations (‘true’ or ‘false’, ‘significant’ or not, etc).

(Recall our 2 two points of differentiation of Bayesian analysis from above:
1: Explicit vs hidden assumptions & 2: Intuitive vs inscrutable outputs)

So while the weight of evidence in this Bayesian analysis leans towards Nat’s side of the bet, there is no simple binary decision in sight. (Compare this to our previous frequentist hypothesis tests, which were unanimously in my favor, but suffered from simplistic, arbitrary & often hidden decision rules & fragile assumptions.)

This is why BEST employs the default 95% HDI & its implicit decision rule (which, if uncritically applied, would also support my side of the bet). One of the benefits of Bayesian analysis, though, (as mentioned) is richer & more nuanced probabilistic analysis, which promotes more careful & deliberate evaluation of hypotheses than is often the case with rote, mechanistic application of frequentist hypothesis tests (as seen in Part 2).

For instance, we can inspect the full standard BEST output via the simple call:

plotAll(BESTout)

(Refer to the vignette for details on these outputs.)

In addition to providing a wealth of information capable of addressing a wide variety of questions, the outputs are straightforward probabilities & meant to be interpreted as such. For example:

There’s an 81% chance that W_L teams have a higher win % than L_W teams. However, that chance drops to 62.5% if we’re interested in a 5% or greater win % advantage.

Compare this to comparable valid statements of frequentist inference:

We fail to reject the null hypothesis at the 5% alpha level. With a p-value of .377, we’d expect a result at least this extreme to occur purely by chance 38% of the time. Thus the result is not statistically significant.

So we’ve seen how BEST’s application of Bayesian inference to traditional hypothesis testing highlights two of its main advantages: explicit assumptions & intuitive outputs. But we haven’t yet mentioned the perhaps most (in)famous feature of Bayesian probability: the incorporation of prior information.

In keeping with its intent to offer a drop-in t-test replacement, BEST will automatically derive minimally informative priors by default, & does not offer especially flexible customization on this front. So we’ll move on to the next method.

diagnostic plots of samples from the posterior via the BayesFactor package
Diagnostic plots of samples from the posterior via the BayesFactor package

BayesFactor
Another adaptation of Bayesian probability to traditional hypothesis testing applications comes in the form of so-called “Bayes Factors”. Put simply, a Bayes Factor is just a ratio of the likelihood of two hypotheses being true, in light of some data (also known as the ‘likelihood ratio’). As such, the Bayes Factor is used to quantify the credibility of, & ultimately decide between, competing hypotheses (as encoded in statistical models). Thus Bayes Factors are often proposed as Bayesian alternatives to frequentist p-values (& are also met with similar criticisms).

The BayesFactor R package from Richard D. Morey enables straightforward application of this method to a variety of traditional statistical tests & models, explained in depth in the documentation page as well as a series of blog posts.

Similar to BEST above, BayesFactor also provides a simple drop-in replacement for t-tests by way of the ttestBF() function:

#install.packages("BayesFactor")
library(BayesFactor)
bf <- ttestBF(w_l_wins, l_w_wins)bf

Output:

Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.2630487 ±0%
Against denominator:
Null, mu1-mu2 = 0
---
Bayes factor type: BFindepSample, JZS

What does this tell us?

It shows the Bayes factor likelihood ratio in the form of the “Alternative” hypothesis (numerator) vs the “Null” hypothesis (denominator) — using default parameters.

Recall from Part 2 of this series, the null hypothesis is my hypothesis: that there is no meaningful difference in series win % between our groups. The alternative hypothesis (or Nat’s hypothesis) is the counterpart: that there is a meaningful difference.

The Bayes factor in this case is 0.263, indicating that the Alt. hypothesis is not even equally as likely as the Null to be true, which would yield a Bayes Factor ratio of 1. We can invert & restate this finding in terms of the likelihood of the Null vs the Alt by simply inverting the Bayes factor object (aka taking the reciprocal):

1 / bf # or equivalently t(bf)

Output:

Bayes factor analysis
--------------
[1] Null, mu1-mu2=0 : 3.801578 ±0%
Against denominator:
Alternative, r = 0.707106781186548, mu =/= 0
---
Bayes factor type: BFindepSample, JZS

Here, comparing the likelihood ratio of the Null to the Alt hypothesis, we see that the Null is 3.8x more likely to be true (which is the same as taking the reciprocal of 0.263). (The r value in these outputs relates to the scale of the Bayesian prior, here left to the default setting.)

According to conventional interpretations, a Bayes factor of this size is just barely considered ‘substantial’, but not nearly ‘strong’ or ‘decisive’. Thus we have a weak advantage for the Null (aka me), but one that is at least “worthy of mention”. (This generally agrees with the frequentist NHST findings in Part 2, which supported the Null with weak effect sizes, as well as BEST’s conventional decision rules discussed above.)

“K” here stand for the Bayes factor. Via wikipedia

Let’s pause here to consider some advantages of Bayes factors over traditional frequentist methods:

As described in the previous installment: one of the major complaints about frequentist NHST is that we can never truly accept the Null hypothesis; we can only “fail to reject” it. As I wrote then:

This not only subjects us to the sort of tortuous language & logical backflips endemic to Fisherian frequentism, but leaves us without a substantive conclusion or basis for decision-making, per se.

Bayes factors, however, explicitly & directly compare the relative likelihoods of competing hypotheses, given the weight of available evidence, allowing us to clearly quantify & express an empirical preference.

Another gripe with frequentist NHST involves its prohibition / penalization of ‘multiple comparisons’ or ‘multiple hypothesis tests’ using the same data. Bayes factors (& Bayesian methods more generally) have no such prohibition. In fact, such practices are encouraged.

As with BEST above, we might want to iteratively test or explore more nuanced hypotheses, such as the likelihood of only positive effects (e.g. only positive differences in win %), or differences only within a specific range of practical significance.

For instance, to evaluate a hypothesis of only positive differences, we can simply define a nullInterval where the effect size (‘d’) is between 0 & positive infinity:

bf_pos <- ttestBF(w_l_wins, l_w_wins, nullInterval = c(0, Inf))bf_pos

Output:

Bayes factor analysis
--------------
[1] Alt., r=0.707 0<d<Inf : 0.4201821 ±0%
[2] Alt., r=0.707 !(0<d<Inf) : 0.1059153 ±0%
Against denominator:
Null, mu1-mu2 = 0
---
Bayes factor type: BFindepSample, JZS

Note that we now get two Alt. hypotheses, both still evaluated against the point-null hypothesis of ‘ 0 difference between groups’. The first Alt ([1]) is the one we requested (a positive effect between 0 & infinity), while [2] is simply the inverse (a negative effect between negative infinity & 0). Both compare unfavorably to the Null, but the positive effect ([1]) fairs better than the negative ([2]). In fact, we can produce a Bayes factor of these Bayes factors as follows:

bf_pos[1] / bf_pos[2]

Output:

Bayes factor analysis
--------------
[1] Alt., r=0.707 0<d<Inf : 3.967151 ±0%
Against denominator:
Alternative, r = 0.707106781186548, mu =/= 0 !(0<d<Inf)
---
Bayes factor type: BFindepSample, JZS

This tells us that the hypothesis of positive effect is nearly 4x more likely than the negative effect hypothesis.

We should remember, though, that both compare poorly to the Null hypothesis of no effect. We can display this by simply plotting our Bayes factor object. But while we’re at it, why not compare all three of our Alt. hypotheses evaluated so far? We can again do this simply by concatenating the bf objects before plotting:

plot( c(bf, bf_pos) )

Here we have Bayes factors plotted for our 3 Alt hypotheses vs the Null. Each is < 1, indicating smaller likelihoods than the Null (no effect). The “negative” Alt is the smallest, followed by the hypothesis of effects of either sign, with the “positive” Alt showing the largest likelihood among Alt hypotheses (closest to 1).

As with the BEST method above, we see the richer, more flexible & more intuitive nature of Bayesian inference on display here. But also similar to BEST, BayesFactor employs standard priors with limited flexibility, considerably constraining the full scope of Bayesian analysis. (This is a reasonable tradeoff to facilitate drop-in alternatives to traditional NHST tests.)

Further, as alluded to in the preface above, many Bayesian critics of frequentist hypothesis testing approaches are equally skeptical of these Bayesian alternatives, along with the entire NHST paradigm. In addition to the aforementioned frequentist flaws (hidden assumptions, rampant misunderstanding & mechanical misapplication), the thrust of these broader critiques of NHST have to do with the artificiality & scientific irrelevance of Null Hypotheses themselves. I.e., a hypothesis of exactly 0 effect is almost never true, or scientifically interesting. And so an entire scientific paradigm built upon such an edifice is fundamentally, fatally flawed. Traditional Bayesian analysis is thus proposed as especially well-suited to displace this paradigm.

BayesAB

A/B testing, the web-native incarnation of traditional Randomized Control Trials, are an obvious & increasingly prevalent use-case for NHST. And due to the continuous, iterative nature of web / digital analytics, where data is collected as long as users interact with the product, Bayesian methods are especially well-equipped to analyze these experiments. Aforementioned issues with frequentist testing such as multiple comparisons, confusing power analyses & interpretability of results all come to a sharp point in modern web testing & experimentation. Thus the BayesAB package was developed by Frank Portman to provide easy access to Bayesian methods specifically from the perspective of A/B testing.

Via the BayesAB website intro (bolding mine):

For example, repeated testing is an issue in A/B tests. This is when you repeatedly calculate the hypothesis test results as the data comes in. In a perfect world, if you were trying to run a Frequentist hypothesis test in the most correct manner, you would use a power test calculation to determine sample size and then not peek at your data until you hit the amount of data required. Each time you run a hypothesis test calculation, you incur a probability of false positive. Doing this repeatedly makes the possibility of any single one of those ‘peeks’ being a false positive extremely likely.

Would you rather say “P(A > B) is 10%”, or “Assuming the null hypothesis that A and B are equal is true, the probability that we would see a result this extreme in A vs B is equal to 3%”?

Like the approaches above, BayesAB is again “intended to be a drop-in replacement for common frequentist hypothesis test such as the t-test and chi-sq test”, but provides a richer & more flexible interface for specifying priors & thus conducting more varied analyses.

Let’s take it for a spin.

Specifying a prior
The first thing we have to do (after installing & loading the package) is to specify our prior. Bayesian priors are meant to represent our prior expectations before data is collected & analyzed.

There are many approaches to this: as we saw above, the BEST method automatically applied minimally informative priors while BayesFactor utilized non-informative priors with some potential customization of the ‘scale’. All this means is that we start with a very loose sense of what the data will tell us, & thus this ‘prior’ information has minimal impact on the ultimate output.

BayesAB, in contrast, allows more flexible specification of priors, so we can choose along a spectrum of very weak to very strong priors. (More details in the docs.)

So what is a reasonable prior expectation for our application?

If we were to approach the question purely subjectively & a priori, we might simply ask ourselves: how often do we expect away teams in NBA playoff series to win the series? Since away teams are always the lower-ranked team in the series, we’d expect them to lose more than they win. And because the away team is most often considerably lower-ranked than their opponent (first round matchups are: 1st place vs 8th place, 2nd vs 7th, 3rd vs 6th & 4th vs 5th), we might expect the away team win rate to be somewhere around 20–30%.

We can of course also calculate the empirical win rate, prior to segmenting teams into W_L & L_W categories, for a more empirical prior. Here we can simply refer to our dashboard with appropriate filter selections:

We see the total series win % in the bottom right: 24%, right in line with our estimate. But if we think a bit further, we’re actually interested in the probability of an away team winning the series given that they win 1 out of the first 2 games. This might nudge our expectation back towards 50%, since these teams demonstrate an ability to defeat their opponent early in the series. In fact, if we check our data again, we see the empirical win rate is 40%:

So we’ll center our prior probability on about 40%, but how strong is this belief? This is essentially asking: how narrow or wide should we make the prior distribution?

As described in the docs & vignette, BayesAB employs conjugate priors & provides helper functions to find a reasonable fit. Since our data describes binary outcomes (wins or losses), it is best modeled by a Bernoulli distribution (just like coin flips), & the conjugate prior for Bernoulli distributions is the Beta distribution.

Using BayesAB’s plotBeta() convenience function, we find several plausible options for a prior distribution:

#install.packages("bayesAB")
library(bayesAB)
# weak prior
plotBeta(3, 4)
# stronger prior
plotBeta(12, 17)

Output:

Both distributions are centered on roughly .40, but the former admits a much broader range of probable values than the latter, making it a so-called “weak” or “minimally informative” prior.

Now that we’ve chosen some priors, we can fit the Bayesian model to our data via the bayesTest() function:

# fit model
AB_strongprior <- bayesTest(w_l_wins, l_w_wins,
priors = c('alpha' = 12, 'beta' =17),
distribution = 'bernoulli')
# display summary output
summary(AB_strongprior)
# produce plots
plot(AB_strongprior)

Output:

Quantiles of posteriors for A and B:$Probability
$Probability$A
0% 25% 50% 75% 100%
0.2270689 0.3930053 0.4264228 0.4609739 0.6679833
$Probability$B
0% 25% 50% 75% 100%
0.1842002 0.3410363 0.3739977 0.4076374 0.6115122
--------------------------------------------P(A > B) by (0)%:$Probability
[1] 0.77255
--------------------------------------------Credible Interval on (A - B) / B for interval length(s) (0.9) :$Probability
5% 95%
-0.1473586 0.5354154
--------------------------------------------Posterior Expected Loss for choosing B over A:$Probability
[1] 0.02552246

This tells us:

  • There’s a 77% probability that “A > B”, or in other words, that W_L teams’ win rates are greater than L_W teams’ win rates. (Compare to BEST’s estimate of 81%)
  • A 90% “credible interval” (Bayesian confidence interval) for the true difference in win rates between these groups spans between -15% & +54%, indicating that we cannot have high certainty of a positive difference in win %.
  • If we choose wrongly by betting on the W_L teams when in fact the L_W true win rate is actually greater, we’d expect W_L teams to have about a 2.5% lower win rate than L_W teams, on average. (This is the so-called “Posterior Expected Loss”.)

How would these estimates change if we started with our weak prior?

AB_weakprior <- bayesTest(w_l_wins, l_w_wins, 
priors = c('alpha' = 3, 'beta' = 4),
distribution = 'bernoulli')
summary(AB_weakprior)plot(AB_weakprior)

Output:

Quantiles of posteriors for A and B:$Probability
$Probability$A
0% 25% 50% 75% 100%
0.2215005 0.3926072 0.4314859 0.4708627 0.6816420
$Probability$B
0% 25% 50% 75% 100%
0.1575208 0.3259168 0.3632656 0.4017386 0.6083010
--------------------------------------------P(A > B) by (0)%:$Probability
[1] 0.80051
--------------------------------------------Credible Interval on (A - B) / B for interval length(s) (0.9) :$Probability
5% 95%
-0.1489544 0.6740775
--------------------------------------------Posterior Expected Loss for choosing B over A:$Probability
[1] 0.02503369

The more diffuse prior exerted less of an influence over the posterior probabilities, so they remained slightly more separated, thus the estimated difference is slightly greater: 80% vs the previous 77%.

Yet the credible interval is also wider: -15% & +67% vs the previous -15% & +54%. This is again due to greater uncertainty expressed in the prior & carried through to the posterior.

The expected loss, however, stays roughly the same.

Of course, “strong” & “weak” are relative terms when it comes to priors. We could go even more extreme with a completely “uninformative” or “flat” “uniform prior” (generally a bad idea):

AB_flatprior <- bayesTest(w_l_wins, l_w_wins, 
priors = c('alpha' = 1, 'beta' = 1),
distribution = 'bernoulli')
summary(AB_flatprior)plot(AB_flatprior)

Or an absurdly strong / confident prior:

AB_strongerprior <- bayesTest(w_l_wins, l_w_wins, 
priors = c('alpha' = 400, 'beta' = 600),
distribution = 'bernoulli')
summary(AB_strongerprior)plot(AB_strongerprior)

The flat prior produced results not too dissimilar from our weak prior, but our massively overconfident prior nearly obliterated any difference between our two groups.

This should demonstrate both the flexibility of Bayesian priors, as well as the sensitivity they introduce into the modeling process. Of course, Bayesians argue that this sensitivity also exists in frequentist inference, but it is obscured away by implicit assumptions, rather than made explicit to be grappled with in the open.

So where do we land?

Reasonable priors yield a probability of 70–80% that W_L teams have a higher true win rate than L_W teams, very close to what the BEST method showed. But a 90% credible interval around the estimated difference in win rate spanned between -15% & +54% , overlapping 0 & indicating considerable uncertainty.

We can again quantify a probability estimate of a five-percentage-point-or-greater win rate for W_L teams as follows:

# subtract 100k samples of the posterior of B from samples of A   
a_minus_b <- AB_weakprior$posteriors$Probability$A - AB_weakprior$posteriors$Probability$B
# find the % of differences that are > 5%
length( a_minus_b[a_minus_b >= 0.05] ) / length(a_minus_b)

Output:

[1] 0.5891

This indicates a 59% probability, similar to BEST’s 62.5% estimate, & again closer to a coin flip than a sure bet.

Other Methods, True Bayesian Modeling

So far we’ve considered Bayesian methods as neatly encapsulated in R packages with more or less narrow applications, meant to supplant traditional frequentist hypothesis tests. But Bayesian analysis isn’t a simple package or suite of methods. Instead, it’s a rich paradigm of scientific information processing governed by a simple mathematical rule & probabilistic worldview.

As such, an instance of Bayesian data analysis can be expressed in much simpler (or at least elegant), bare mathematical terms, without the assistance of such convenience functions & packages as we’ve seen so far. One such demonstration is as follows, adapted from the LingPipe blog, itself inspired by a discussion on stats guru Andrew Gelman’s blog (see blog posts for details):

W_L_n = 67 # W_L total series 
W_L_w = 29 # W_L series wins
L_W_n = 67 # L_W total series
L_W_w = 24 # L_W series wins
# SIMULATION
I = 10000 # simulations
theta1 = rbeta(I, W_L_w+1, (W_L_n-W_L_w)+1)
theta2 = rbeta(I, L_W_w+1, (L_W_n-L_W_w)+1)
diff = theta1-theta2 # simulated diffs
# OUTPUT
quantiles = quantile(diff,c(0.005,0.025,0.5,0.975,0.995))
print(quantiles,digits=2)
print("Probability W_L teams have higher win rate than L_W teams:")
print(mean(theta1>theta2))
# VISUALIZATION
plot(density(diff),
xlab="theta1 - theta2",
ylab="p(theta1 - theta2 | y, n)",
main="Posterior Simulation of W_L - L_W win rates",
ylim=c(0,8),
frame.plot=FALSE,cex.lab=1.5,lwd=3,yaxt="no")
abline(v=quantiles[2], col="blue")
abline(v=quantiles[4], col="blue")

Output:

  0.5%   2.5%    50%  97.5%  99.5% 
-0.143 -0.091 0.073 0.231 0.281

[1] "Probability W_L teams have higher win rate than L_W teams:"

[1] 0.8095

Similar results as our previous demonstrations.

Further Resources

Of course not all questions we’d want to address with Bayesian analysis are as simplistic as mine here, & as that complexity increases so do the mathematical expressions & computational demands of the solution, so it becomes necessary to have some higher level framework to work with. This is generally the province of “probabilistic programming”.

In R, the popular options include:

🗣️ Single Biggest Recommendation

To go deeper on Bayesian analysis (& truly, scientific probability & statistics more broadly) imo you can do no better than Richard McElreath’s Statistical Rethinking course, including youtube lectures, text book, notes & code, etc. The time I’ve spent with these materials have been the singularly most enriching & intellectually stimulating experience in near-term memory.

(The Empire of Chance being a close second, in a similar, if less practical vein)

Other landmark resources:

Conclusion to Part 3

In this post we explored how Bayesian inference addresses critical shortcomings in frequentist hypothesis testing by:

  • making assumptions explicit & deliberate, rather than hidden & thus subject to mechanical misapplication
  • producing intuitive, interpretable results as opposed to inscrutable, logically contorted findings which are in-turn rampantly misunderstood

Still, the mere application of Bayesian methods to traditional NHST as drop-in replacements does not address the fundamental downsides of the NHST paradigm. For this, more fulsome Bayesian analysis is needed.

And while we came away with a much richer & more interpretable probabilistic assessment of our question of interest, we’re still left without an unequivocal binary determination: am I right or is Nat?

In most real world applications of statistical inference, we’re not simply interested in an answer as much as an action: what should we do?

So in Part 4 (our final installment of this series), we’ll reexamine everything from a decision-theoretic perspective, which quantifies expected outcomes under different decisions & actions, accounting for the relative costs incurred by these actions, as well as expected gains.

These posts have turned out longer, & taken me longer to write, than I expected, but follow me to be sure to catch the next installments, & check out my other posts in the meantime. Also, please comment with thoughts, objections or corrections!


Follow on twitter: @dnlmc
LinkedIn: linkedin.com/in/dnlmc
Github: https://github.com/dnlmc

--

--

Daniel McNichol
Coεmeta

Founder & Chief Scientist @ Coεmeta (coemeta.xyz) | formerly Associate Director of Analytics & Decision Science @ the Philadelphia Inquirer