[Statistics] Comparing two binomial random variables

4 min readFeb 15, 2023

Problem statement. Suppose you have two [classifiers / coins / medicines], and want to determine which one has a [higher precision / heads-rate / effectiveness]. One way to decide this is to collect a few samples, and perform a hypothesis test.

Key questions. There are two key questions that we will discuss here: (1) What statistical claims can we make about the sampled results? (2) How many samples to collect? We will offer some practical options in this post.

What statistical claims can we make about our samples?

TL;DR. P-value, Confidence intervals for [each point estimate, the difference in proportion], and go bayesian with posterior odds.

P-value

What. This is often (mis?)used to measure how unlikely a certain hypothesis is (e.g reject p1 <= p2). It is also the easiest to compute because many open source statistics software support this computation without the user having to do much math. However, please see this post for more details on why it can be tricky to compute the right number, and why we shouldn’t use just p-values for decision making.
Example claim: The p-value we got is 0.03 < 0.05, and so it’s unlikely that the observed samples result from the assumption where p1 <= p2. Thus, let us conclude p1 > p2.

2. Confidence interval of each point estimate

What. Suppose you get 40/100 heads from your samples. You want to get a sense of how good this estimation is given your sample size.
Example claim: “Given that we have n=100 samples, then 0.95 of the repeated trials would result in our sampled precision to be within 0.1 of the true precision (the one computed on the entire dataset)”
Math. The sampled precision p1', for the true hidden precision p1, is a random variable of the form: 1/n * Binomial(n, p1), which can be approximated by Normal(mu=p1, var=1/n *p1*(1-p1)). Note that the sampling distribution is centered around the true mean.
Example calculation. Given [p1, n], you can know the variance. The 68–95–99.7 rule of thumb is handy here. E.g. if you have n = 400, and p1 = 0.5, then stddev = sqrt(1/400) * 0.5 = 0.025. Then, we can say that with 0.95 confidence, the sampled precision will be within 2 stddev (0.05) of the true precision! This matches the math from [Bommannavar2014] : “using a margin of error of 0.05, we calculate the number of samples necessary to be 384”

3. Confidence interval of the difference

What. Another thing to provide CI of is the difference! See this post for examples of both types of CIs.
Example claim: “Given that we have n=100 samples, then 0.95 of the repeated trials would result in our sampled difference to be within 0.1 of the true difference (the one computed on the entire dataset)”
Math. p1'-p2', the sampled difference, is Normal(mu=p1-p2, var=1/n (p1*(1-p1) — p2(1-p2)). This follows from the addition and scalar multiplication of normal RVs. To get a concrete distribution so you can compute the CI, you would use the sampled p1', p2' and plug em in. This is pretty shady, but it looks like that’s the way people compute CI’s for intervals. See khanacademy.

4. Bayesian metrics: Posterior odds, bayes factor

What. A more statistically sound decision-making process than just using p-values, which only considers the null hypothesis, is to consider both the null and alternate hypothesis. Specifically, we can compute the posterior odds P(H1|data)/P(H0|data), and conclude whether H1 is more likely than H0 given the data.
How. Computing the posterior odds, even when you assume equal priors P(H0) = P(H1), is unfortunately more involved than computing the p-value. I couldn’t find open source libraries that do this — the closest is R’s ttestBF, but even then it only handles the two-sided H0-H1 setup: H0: p1=p2, H1: p1!=p2, whereas I’m interested in the one-sided H0: p1<=p2 setup.
Math. Note these facts about beta distribution: (1) Beta(1,1) is a uniform probability distribution and (2) Beta is a conjugate prior to Binomial distribution. So, if you assume uniform / Beta prior, then the posterior will be a Beta distribution. Given this mathematical fact, you can estimate P(H0|x) through sampling (see statsexchange), and then compute the posterior odds.

How many samples to collect?

Here are some guidelines you can use to determine the sample size.

Method 1: CI of point estimate. [Bommannavar2014] did this. Please see the section on “Confidence interval of each point estimate”. By declaring the margin of error of a single point estimate, you can derive the sample size needed. E.g. “If we want our sampled precision to be within 0.1 of the true precision 0.95 of the time, we need at least n=100 samples.”

Method 2: Exact test validations. You can compute various p-values of some scenarios you would want to reject, and verify that the p-value is rejected in those scenarios. E.g. run barnard’s to compare samples from a 0.8 vs 0.7 precision distributions, and verify that given your sample size you get p-values < 0.05 for the null hypothesis that p1 <= p2 . (See p-value post > Barnard). These exact tests are more precise so you might end up with a lower sample size requirement than the other methods.

Method 3: Power analysis. With power analysis, when you are doing p-value testing, you can relate all of these quantities: [p1,p2, power = 1-P(reject H1| H1), significance level, sample size]. You can use this online tool to compute the sample size. As for the derivation of this formula behind this tool, refer to chapter 4.2 of Fleiss, J. L., Levin, B. and Paik, M. C. (2003). Statistical Methods for Rates and Proportions, Third Edition, John Wiley & Sons, New York. — Warning: The book doesn’t really give out the full derivation. If you want to derive it yourself, there are two phases: (1) Start with P(Reject H0 | H0) < alpha, then (2) tie in P(reject H0| H1) >= power, but you replace the “Reject H0” event with the criteria you get from (1)

References

[Bommannavar2014] Recall estimation for rare topic retrieval from large corpuses

[Statistics] Comparing two binomial random variables

What statistical claims can we make about our samples?

How many samples to collect?

References

Written by Stephen Jonany