[Statistics] P-values when comparing two binomial random variables

5 min readFeb 14, 2023

Note: For metrics beyond p-values, see this follow-up post.

Problem: compare two binomials. A solution: p-value testing. Suppose you have two [classifiers / coins / medicines], and want to determine which has a [higher precision / heads-rate / effectiveness]. One way to decide this is to do a hypothesis test using one-tailed two-sample binomial proportion test, where you compute the p-value, and you reject the null hypothesis (e.g. precision 1 ≤ precision 2) iff p-value < 0.05. This post is a collection of my learnings when performing this hypothesis test.

P-value definition. The one-sided P-value, given that you have seen concrete numbers p1', p2' is defined as such: P(Observe the samples p1', p2' and its more extreme versions | null hypothesis: p1 <= p2 ). If this is low (say, below significance level of 0.05), we can reject the null hypothesis and conclude p1 > p2 is more likely.

Main learnings when comparing two binomial random variables:

Use Barnard’s exact test when computing the p-value
P-value alone should not be used to make decisions.
One-sided vs two-sided test —the difference, and why you use equality for one-sided tests’ condition.
Significance level is Type I error probability, and its proof

Learning 1: Use Barnard’s exact test when computing the p-value

To compute the p-value, use Barnard’s exact test (scipy).

Q: Why not Fisher exact test? A: Inappropriate assumption. Because Fisher’s assumes we fixed the sample sizes and the sum of the true positives, whereas in practice we usually only know the former. Here are some quotes:

Wiki “Barnard’s test relaxes this constraint on one set of the marginal totals.”
Wiki: “The theoretical difference between the tests is that Barnard’s test uses the double-binomially distributed, whereas Fisher’s test, because of the conditioning uses is the hypergeometric distribution.”
Scipy: “As stated in [2], Barnard’s test is uniformly more powerful than Fisher’s exact test because Barnard’s test does not condition on any margin. Fisher’s test should only be used when both sets of marginals are fixed.”

Q: Why not use normal approximations? A: Accuracy. Computing p-value with normal approximation is simpler because we can just use the normal distribution’s CDF. However, if the sample size is small, you might as well skip the approximation and use the more accurate binomial distribution. See this nist chapter for more details.

Learning 2: P-value alone should not be used to make decisions.

[Wasserstein 2016] was very clear on this: “Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.”

An intuitive example. A p-value-based decision-making looks like this: “Reject null hypothesis and conclude the alternative if p-value is < 0.05”. Here is an intuitive example of why this decision-making can be faulty. Let the null hypothesis be: bigfoot isn’t real. If we get a low result for P(see a meteor today | bigfoot isn’t real), we shouldn’t conclude that bigfoot is real. Seeing meteors is already a rare event after all.

Weaknesses of p-value. [Goodman1999], [statsexchange] listed a couple:

Its number only assumes H0 is true: “the P value is calculated on the assumption that the null hypothesis is true. It cannot, therefore, be a direct measure of the probability that the null hypothesis is false”
It ignores sample size. Worse, you can reduce it by increasing your sample size. “A small effect in a study with large sample size can have the same P value as a large effect in a small study”
It ignores effect size. You don’t know the magnitude of the difference, or the estimated ranges of p1 and p2.

Recommendation: add more metrics. The recommendations unfortunately are pretty broad. [Greenland2016] suggests reporting (1) confidence intervals, so we can see the effect sizes and (2) P-values of the alternate hypotheses. “Another way to bring attention to non-null hypotheses is to present their P values; for example, one could provide or demand P values for those effect sizes that are recognized as scientifically reasonable alternatives to the null.” [Benjamin2019] also suggested (3) reporting other measures that include the alternate hypothesis, such as the bayes factor or posterior odds.

Learning 3: One-sided vs two-sided test

Consider this setup for computing p-value: P(see sample and its extremes | H0). Let’s consider how they differ between one-sided and two-sided tests. Let’s also assume (1) Samples are ~ Bin(n, p) (2) The samples we see k1 and k2 respectively.

For two-sided tests, H0: p1=p2, H1: p1 != p2. P-value = sum_k1',k2' { P(Bin(n, p1) = k1', Bin(n,p2) = k2'))}, where k1',k2' are events where the joint probability for k1',k2' is less than the joint probability for K1, K2. This is what it means to observe the sample and its more “extreme” (lower probability) versions. For p = p1 = p2, we can estimate using the sampled proportions (k1 + k2) / 2n. See khanacademy.

For one-sided tests, the two gotchas to pay attention to are (1) the one-sided inequality sign on the event, which is opposite the inequality direction in H0 and (2) the usage of equality sign in the null hypothesis. Both of these combined means that one-sided p-value should always be no greater than two-sided p-value, and so are easier to reject. Let’s simplify and and compare against a constant to illustrate these two gotchas.

Equality v.s. range in null hypothesis in one-sided tests. Note that in one-sided tests, we will still condition on the distribution’s parameters being equal to a constant. This section explains why:

See statsexchange: “When you have a composite hypothesis there are many possibilities. In this case, there are two natural types of strategies, either a Bayesian one (i.e. put weights on the different null distribution) or a minimax one (where you want to construct a test that has a controlled error in the worst case. … Hence to avoid talking about minimax people directly take the simple null that is the ‘extreme point’ of the composite setting”
The insight is that the original formulation P(Bin(n, p1) >K1' | p1 ≤ k) is at most P(Bin(n, p1) >K1' | p1 = k). If we know the prior distributions of p1, we could have weighted this conditional for each point-value of p1, but the end result would always be at most the latter term. Thus, all the possible family of null hypotheses are collapsed to the scenario that gives the highest possible p-value.
Note: LHS’ inequality direction. For H0: p1 <= k, H1: p1 > k. P-value = P(Bin(n, p1) >K1' | p1 ≤ k). Note the one-sided sign “> K1” that is opposite of the null hypothesis inequality (≤ k). If you don’t pick the opposing inequality, then the max p-value will not be when p1 = k, but when p1 = 0.

Learning 4: Significance level is Type I error probability

The significance level (the p-value threshold for rejecting the null hypothesis) is the same as the probability of falsely rejecting null hypothesis. See this for proof.

[Statistics] P-values when comparing two binomial random variables

Learning 1: Use Barnard’s exact test when computing the p-value

Learning 2: P-value alone should not be used to make decisions.

Learning 3: One-sided vs two-sided test

Learning 4: Significance level is Type I error probability

References

Written by Stephen Jonany