Trust the Process, Doubt the Procedure

NBA playoff win chances via statistical inference

Settling a bet with spicy statistical takes

Daniel McNichol
Coεmeta
Published in
16 min readJul 15, 2019

--

This is a parable about simple, straightforward questions of fact, & how they often devolve into complex matters of data processing, analysis & decision-making under fragile epistemic limits, in the real world.

(This is Part 2 of 4, exploring frequentist hypothesis tests & their flaws.
Each post stands on its own.
Find Part 1 here or read the brief recap below. Part 3 exploring Bayesian alternatives is here, & Part 4 on decision theory is here.)

Run it Back: Part 1 Recap

Wagers, Probability Theory, Data Collection, Wrangling & Analysis

via thelead

Last time, I described a bet between myself & my friend / colleague Nat, in which he asserted that away teams in NBA Playoff series are more likely to win the series if they WIN & THEN LOSE the first two away games, than if they split the first two games in the opposite order, despite an identical series score heading into game 3. I demurred, sensing a gambler’s fallacy at play:

I go on to describe my reasoning & the fundamental probability theory underlying it, which I won’t repeat here.

I also describe the semi-complex data collection & processing required to address the question at hand, utilizing 538’s excellent Historical NBA Elo dataset (CC BY license). I reproduced that data as well as intermediate pre-processed outputs in a public Google BigQuery dataset (under the same license as 538).

Finally, I surfaced the processed results & initial summary statistics intended to resolve the bet in an interactive Data Studio dashboard, reproduced below (with default filters set to conditions best suited to the original wager, under modern NBA Playoff rules — post-2002).

As can be seen, between 2003–2015, over 84 playoff series, away teams splitting the first 2 games of the series had essentially the same win %, regardless of the order of the win & loss (39.1% to 39.5%).

I was tempted to declare victory here, but decided to try to salvage the 1984–2002 best of 7 playoff series to increase the sample size. You can see for yourself by adjusting the filters above, but here’s the topline results:

Now things get a bit muddled. At a sample size of 134 total series, the win-then-lose segment has pulled into a 43% to 36% advantage, all of which accrued between 1984–2002 (since we’ve already seen it was tied since 2002):

Between 1984–2002, the first round of NBA Playoffs were best of 5 series, which are excluded here & thus explains some of the lower incidence of series beginning with split results over that period. (The first round is also by nature the round with the most total series in play, & thus results in the most series lost from our sample by excluding it).

But is this advantage real? Or simply expected random fluctuation around otherwise equal winning chances?

As teased in Part 1: This sounds like a question about statistical significance, but, as you might have heard, scientists are rising up against it, as the (mostly Bayesian) statisticians have long advocated.

Cool cool. Traditional so-called Null Hypothesis Significance Testing (NHST) is on the outs, but surely the Bayesian alternatives will save us?

oh…oh no.

Well then.

How to proceed?

Bets must be settled. Decisions must be made. Science must advance.

Enter the hairy field of decision theory/science/analysis.

Comparative Data Analysis

Frequentist approaches

Before delving into alternative data analytic approaches to resolving the question, it’s worth clarifying that this is a relatively narrow question about historical trends more than a serious attempt at predictive modeling. For the latter, many more sophisticated approaches are available.

The Scourge of Traditional Frequentist NHST

Much of the recent uproar around statistical significance relates to the unthinking, rote application of these methods as much as the methods themselves. Inadequately understood procedures, unexamined assumptions & arbitrary thresholds are reflexively (mis)applied & presumed to produce robust, authoritative results — which turn out to be anything but.

The proliferation of statistical test flow charts reflects this mechanistic approach:

Two example flow charts

These imply a simple deterministic logic will lead analysts / scientists to the single objectively correct statistical procedure for determining The Truth, with little context or qualification.

Similarly constructed tables offer a bit more information, but are often just as crude, & routinely prescribe tests that, according to well-regarded stats gurus, are approximately never appropriate:

via http://www.biostathandbook.com/testchoice.html

These well-intentioned guides are gross oversimplifications which lead to many of the pitfalls described above. But lets assume we’re operating in this paradigm for demonstrative purposes.

Frequentist Hypothesis Tests

Here’s a reminder of the data at hand for our statistical inquiry:

This is the post-1983, best-of-7 series data. (There’s no use doing statistical inference on two groups with essentially identical win %’s, as is the case with the post-2002 data.)
The data represented via boxplots

We want to determine if the difference in win % between our two groups is real, however we might define that. Following flow charts or tables such as those above, we’d typically end up with the following hypothesis tests, depending how we construe the data:

  • two-proportion t-test or z-test
  • chi-square test
  • exact binomial test
  • or, gurus forbid, a Fisher’s exact test

As the name implies, Null Hypothesis Significance Testing is oriented around a “null hypothesis”, which assumes no difference between the two groups being compared, then subjects that assumption to mathematical tests.

In our case, the null hypothesis is in fact my hypothesis:

there is no statistical difference in series win % between away teams who win-then-lose & those who lose-then-win.

The so-called “alternative hypothesis” (aka 👎 Nat’s hypothesis 👎) is the opposite: there is a statistical difference in win %.

Since we’re naively following largely arbitrary conventions here, we’ll keep it consistent & require a confidence level of 95%, aka a significance / alpha level of 5%, thus a “p-value” < .05. These are the quantitative thresholds used as a cutoff point to determine the significance or “realness” of the observed differences in the data. They’re also at the center of the outcry around NHST.

Two-proportion t-test or z-test

via stackexchange

First up is probably the most common of the NHST tests: the t-test or z-test, in which we’ll be specifically testing group proportions.

I’m not going to get into many details of these tests in this post, as those are amply covered, everywhere. Suffice to say that t & z-tests are analogous tests used in subtly different situations, but are practically equivalent for large samples. In fact, nominally different but practically (& sometimes technically) equivalent tests will emerge as a theme here, & in much of statistics.

t & z-tests essentially compare observed results to expected frequencies of such results under assumed normal probability distributions, producing a p-value which indicates how often such results should be expected due to sheer chance if there is no true difference between the groups being compared.

There are several approaches in R:

The standard t.test() function only accepts vectors of observations (as opposed to summarized counts, tables or proportions). So first we have to create these vectors of series wins & losses for both groups (win = 1, loss = 0).

w_l_wins <- c(rep_len(1, 29), rep_len(0, 38))
l_w_wins <- c(rep_len(1, 24), rep_len(0, 43))

this gives:

> w_l_wins
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> l_w_wins
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

We can then call the t.test() function, using default parameters in keeping with our naive approach:

t.test(w_l_wins, l_w_wins)

Output:

Welch Two Sample t-testdata:  w_l_wins and l_w_wins
t = 0.87932, df = 131.86, p-value = 0.3808
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.0932546 0.2425083
sample estimates:
mean of x mean of y
0.4328358 0.3582090

We receive a p-value of 0.38 (considerably above our significance level of < 0.05) as well as a 95% confidence interval overlapping 0, both of which indicate a statistically non-significant difference between the win %’s of our two groups.

Translation: score one for me & the null hypothesis. (kinda, keep reading)

But let’s proceed.

A z-test of proportions is implemented in R via the prop.test() function, which will accept contingency tables as inputs.

First we’ll build our table of series win & losses (as pictured above):

c_tab <- rbind(c(29, 38), c(24, 43))colnames(c_tab) <- c("W","L")
rownames(c_tab) <- c("W_L","L_W")

which gives:

     W  L
W_L 29 38
L_W 24 43

Then we can call prop.test() on the table. This time we will adjust one default parameter for fairness, which would otherwise inflate our p-value & put Nat’s alternative hypothesis at a disadvantage:

prop.test(c_tab, correct = F)

Output:

2-sample test for equality of proportions without continuity correctiondata:  c_tab
X-squared = 0.78034, df = 1, p-value = 0.377
alternative hypothesis: two.sided
95 percent confidence interval:
-0.09046782 0.23972155
sample estimates:
prop 1 prop 2
0.4328358 0.3582090

Once again, our p-value rounds to 0.38 with a confidence interval overlapping 0. Thus, score another for me & the null hypothesis.

Moving on.

Chi-squared test

via wikipedia

A chi-squared test is actually equivalent to the two proportion z-test we just ran.

This test is used to compare the ‘goodness of fit’ of observed frequencies to a ‘model’ distribution or another set of observed frequencies, to estimate the likelihood that both come from the same distribution.

We can demonstrate its equivalence to prop.test() by using the chisq.test() function on our previously constructed contingency table, again disabling the continuity correction parameter:

chisq.test(c_tab, correct = F)

Output:

Pearson's Chi-squared testdata:  c_tab
X-squared = 0.78034, df = 1, p-value = 0.377

Note the identical p-value. More bad news for Nat’s alternative hypothesis.

For fun & demonstration of the non-robustness of rote application of these tests, let’s take a look at the results we’d get if using the default parameters for prop.test() & chisq.test():

> prop.test(c_tab)2-sample test for equality of proportions with continuity correctiondata:  c_tab
X-squared = 0.49942, df = 1, p-value = 0.4798
alternative hypothesis: two.sided
95 percent confidence interval:
-0.1053932 0.2546469
sample estimates:
prop 1 prop 2
0.4328358 0.3582090

> chisq.test(c_tab)Pearson's Chi-squared test with Yates' continuity correctiondata:  c_tab
X-squared = 0.49942, df = 1, p-value = 0.4798

P-values jumped by .10, a 26% relative increase & 2x our alpha level of .05, which could easily be the difference between “significance” & “non-significance” at an arbitrary significance level.

“Exact” tests

Binomial distribution formula via ufl.edu

So-called ‘exact’ statistical tests are significance tests that do not rely on large samples for theoretical (asymptotic) accuracy, as common parametric tests typically do (such as those above).

They still must be applied appropriately (which is an allegedly rare occurrence, as mentioned above) to obtain sound results. And in practice, some are implemented in statistical software via approximate (i.e. inexact) algorithms, due to the computational complexity of the analytical solutions.

All to say, their “exactness” is often a misnomer, as is often the case with statistical jargon. We’ll carry on naively nonetheless.

A “binomial test” is an exact test which determines whether two outcomes or categories are equally likely to occur.

In our case, that fits our null hypothesis, posed as: W_L away teams are equally likely to win the series as L_W away teams.

To execute the test, we’ll compare the series wins & losses of the L_W teams to the series win % of the W_L teams, using the binom.test() function. This essentially uses the W_L team series win % as the expected probability, & tests the null hypothesis that it is equal to the L_W series win %.

binom.test(c(24,43), # series wins & losses of L_W teams
p = 29/(29+38) # series win % of W_L teams
)

Output:

Exact binomial testdata:  c(24, 43)
number of successes = 24, number of trials = 67, p-value = 0.267
alternative hypothesis: true probability of success is not equal to 0.4328358
95 percent confidence interval:
0.2446949 0.4846965
sample estimates:
probability of success
0.358209

We receive a p-value of 0.27, again substantially exceeding our < 0.05 alpha level, indicating a statistically non-significant difference between our groups.

To cap off our naive romp through frequentist NHST, we’ll crank the naïveté to 11 & run the “hardly ever appropriateFisher’s exact test.

This test is classically used on contingency tables, to compare counts or rates between two groups. The controversy has to do with whether or not the margins of the contingency table are fixed by design (as intended), or randomly varying.

Although we almost certainly shouldn’t, we can simply call the fisher.test() function:

fisher.test(c_tab)

Output:

Fisher's Exact Test for Count Datadata:  c_tab
p-value = 0.4799
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.6446336 2.9072699
sample estimates:
odds ratio
1.364105

Our largest p-value yet, still no solace for Nat’s alternative hypothesis.

Before we categorically declare these frequentist tests in my favor, some critical flaws & shortcomings should be noted.

Issues with Frequentist NHST

via xkcd

In addition to the aforementioned rampant mechanical misapplication of these methods, some critical issues must be acknowledged.

1. Statistical Power

via twitter blog

While excessive emphasis is given to “statistical significance” & p-values, which represent the false positive or type I error rate, insufficient emphasis is often applied to the false negative or type II error rate, represented by “statistical power”. (And when “power” is emphasized, it tends to be poorly understood & misapplied at seemingly similar rates as p-values.)

Still, it’s worth noting that, if we were to commit the faux pas of post-hoc power analysis, we’d find that all of our above tests were drastically “underpowered”. This means that their ability to detect a true difference between the two groups is low, & thus the expected false negative rate is high. Thus all of the results ostensibly favoring the null hypothesis could be simply statistical artifacts due to our small sample size, modest win % difference between groups &/or conventional .05 alpha level.

To illustrate:

Results give:

  • t.test power: 0.14
  • prop.test power: 0.14
  • chisq.test power: 0.14
  • binom.test power: 0.25

This means that these tests have an estimated 14–25% rate of actually detecting a true difference between the groups, which is…not great.
(Rote convention typically aims for 80% power.)

But, as described in links above, this sort of post-hoc power analysis is meaningless & doesn’t tell us anything more than the p-value already has, given the data we’ve observed. The power might be this low simply due to the minimal difference in win % between our groups as much as it is due to the sample size.

A Priori Power Analysis
So lets pretend we did things “properly”, & performed an a priori power calculation to determine what sample size & minimum effect size would be required for a well-powered t-test (which has the most interpretable effect size metric):

Cohen’s d represents effect size as the difference between 2 group means, expressed in standard deviations.

By convention, a value of 0.2 is considered a “small” effect, 0.5 = “medium” & 0.8 = “large”.

The actual Cohen’s d value of our data is 0.15, indicating a difference of < 1/6 of 1 standard deviation between our two groups, aka a marginally-less-than-“small” effect.

So how large a sample would we need to detect a this “small” of an effect with a sufficiently powered test?

# simply make "n" NULL & specify a power level of .8
pwr.t.test(n = NULL, d = .15, sig.level = .05, power = .8, type="two.sample", alternative = "two.sided")

Output:

Two-sample t test power calculationn = 698.6382
d = 0.15
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group

We’d need a sample of nearly 700 observations in each group (!), as opposed to our current n = 67.

Alternatively, what is the minimum effect size we might observe at our given sample size?

# make "d" NULL while specifying "n" & the desired power
pwr.t.test(n = 67, d = NULL, sig.level = .05, power = .8, type="two.sample", alternative = "two.sided")

Output:

Two-sample t test power calculationn = 67
d = 0.4875966
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group

We’d need an effect size of at least 0.49(roughly “medium”) to have a well-powered test at our given sample size & alpha level.

We can demonstrate this dynamic & the required sample sizes at different effect sizes by passing the pwr objects directly into R’s plot() function (then composing them with the magickal patchwork package):

Output:

These “power curves” show the power gains of a two-sample t-test as sample size increases, for “small”, “medium” & “large” effect sizes — specifically annotating the required n for an 80% power level at each effect size.

64 observations per group would be sufficient to detect a “medium” effect, while few as 26 would suffice to detect a “large” effect.

So what?

We can’t magically increase our sample or effect size, but if we performed this power analysis prior to our observational data analysis, we’d at least know that we’d need a roughly “medium”-sized effect or a considerably larger sample to find a statistically significant result (following naive conventions).

Worth noting: we could also increase the power of these tests by running one-sided or one-tailed versions instead of the two-sided default we naively accepted. This would effectively mean that we test the more specific claim that the W_L win % was greater than the L_W win %, rather than testing for a difference in either direction (greater or lesser).

It could be argued that this is more precisely in line with the actual original wager, but I’ll leave such alternative analysis as an exercise for the reader. (Spoiler alert: power does not increase that much & p-values do not decrease that much.)

Takeaway: statistical power analysis adds a complex wrinkle & yet more mines to the practical minefield of frequentist NHST, complexifying & fragilizing its use in settling matters of fact or belief. It will be seen that these issues are largely avoided, or at least approached differently, in Bayesian analysis.

2. Multiple Testing

via Roy Salomon

The second critical issue with frequentist NHST involves performing multiple tests on the same data set, as we do above.

I won’t go into nearly as much detail here as I did for statistical power, but I feel compelled to note that the frequentist paradigm forbids so-called “multiple comparisons” due its fundamental predication on, well, frequencies of expected occurrences.

To put it simply: if a simple naive significance test is expected to have a 5% false-positive rate, then you’d expect an average of 1 false positive per 20 such tests.

Now, if you run two tests with a 0.05 alpha level on the same data, you’ve roughly doubled your chances of getting a false positive, from 5% to 9.75% (chance of no false positives in 2 tests= .95 * .95 = .9025). Thus, statistical “corrections” must be applied, which open yet another frontier of fragile assumptions & complexity.

So although we might wish to arrive at some ecumenical consensus, frequentist procedures again lay roadblocks in the way, which are avoided or at least more gracefully handled in Bayesian analysis.

3. Cannot “Accept” the Null Hypothesis

via wikipedia

The third & final critical frequentist NHST flaw I’ll note is the fact that null hypotheses can never technically be “accepted” or affirmed. Rather, we can only “fail to reject” them. The simplest explanation for this is the old aphorism: Absence of evidence is not evidence of absence.

This not only subjects us to the sort of tortuous language & logical backflips endemic to Fisherian frequentism, but leaves us without a substantive conclusion or basis for decision-making, per se.

Among the many aforementioned criticisms, this is another reason some statisticians have called for abandoning teaching frequentist stats.

Conclusion to Part 2

The broad sweep of traditional frequentist hypothesis tests provided more support to my side of the bet than to Nat’s. Observed differences in win % since 1984 were nowhere near conventional statistical significance thresholds, & effect sizes were seen to be quite low.

Yet we’ve noted at least as many reasons to distrust these results as to believe them, if not more. So in Part 3, we examine Bayesian alternatives which avoid many of these issues & explicitly denote a degree of belief given the evidence at hand, instead of some backwardly constructed claim about frequencies in a hypothetical infinite sample under fragile assumptions.

Then, in Part 4, we’ll reexamine everything from a decision-theoretic perspective.

These posts have turned out longer, & taken me longer to write, than I expected, but follow me to be sure to catch the next installments, & check out my other posts in the meantime.


Follow on twitter: @dnlmc
LinkedIn: linkedin.com/in/dnlmc
Github: https://github.com/dnlmc

--

--

Daniel McNichol
Coεmeta

Founder & Chief Scientist @ Coεmeta (coemeta.xyz) | formerly Associate Director of Analytics & Decision Science @ the Philadelphia Inquirer