Bounding Your Sample Size by the Chernoff Bound: How Much is Enough?

Published in

Coinmonks

6 min readAug 5, 2024

A Serendipitous Statistical Discussion at The Busy Bean

It was a balmy Friday afternoon, and I had just closed my laptop, ready to embrace the weekend. As I strolled past The Busy Bean, my favorite local café, I spotted my old friend Mark and his wife Sarah huddled over a laptop. After a quick hello, Mark insisted I join them for a coffee. Sarah, he explained, was knee-deep in a research proposal and could use a fresh perspective. Intrigued, I decided to stay.

The café place in the city on Friday afternoon

The aroma of freshly ground coffee beans filled the air as we chatted about life and work. Suddenly, Sarah turned to me with a gleam in her eye. “I hear you’re a data scientist,” she said. “I’ve been grappling with a question for my study. What’s the minimal sample size I need?”

My interest was piqued, and I leaned in. “That’s a great question, but it’s not one-size-fits-all. Can you tell me more about your study?”

Sarah explained that her research involved a binary outcome and that matching the sample’s prevalence to the population was crucial. She also mentioned her assumption of an unbiased sample.

“So,” I clarified, “for a given prevalence p in the general population, you’re looking to determine the minimal sample size needed to ensure that, with high probability, the sample prevalence is close to p?”

Sarah nodded eagerly, relief washing over her face at being understood.

As I sipped my latte, my mind raced through statistical formulas and considerations. I realized that to provide a precise answer, we’d need to delve into concepts like epsilon and delta — statistical parameters that ensure precision and accuracy in sample size determination.

“Well, Sarah,” I began, “to solve this rigorously, we need to consider two key parameters: epsilon and delta. Delta represents how close we want our sample prevalence to be to the true population prevalence, while epsilon represents the probability of our sample falling outside this range.”

Sarah’s eyes widened with interest. “I’ve heard of these terms, but I’ve never applied them practically. How would we use them here?”

I started doing the math, but it was too much for her in the café. She asked me to provide her with the final number. Unlike me, she was not into math. However, I promised her I would solve it. So we decided that we would just enjoy the coffee, and I would solve it later on. We had a great talk, and they insisted on paying for my coffee and cake. We had warm hugs and departed.

Diving into the Math

When I got home, I couldn’t stop thinking about the problem. To formalize it, I came up with the following formulation involving epsilon and delta. For a sample of size n with a binary variable, we need:

Namely, the probability that the average of the binary variable x is δ far from p is less than ϵ. Where δ and ϵ are chosen by the user.

Deriving the Minimal Sample Size

Here’s how you can derive the number of samples n:

Chernoff Bound for Sums of Bernoulli Trials

For n independent Bernoulli trials X1,X2,…,Xn with success probability p, let X be the sum of X1,X2,…,Xn. The mean of X is

The Chernoff bound states that for any δ>0:

Setting the Error Margin ϵ

To estimate p within an error margin of ϵ, we need:

We want the probability of estimating p within ϵ to be at least 1−δ. Using the probability of the complement case:

Rewriting the upper bound with Sarah's condition

Solving for n

After some algebra,

Extending to Multiple Independent Binary Variables

If there are k independent binary variables, the same principles apply, but we must ensure that our sample size n is sufficient to provide accurate estimates for each of the k variables. Assuming each variable has a success probability p(i) (where i ranges from 1 to k), we can derive a similar formula for the minimal sample size n.

The minimal sample size n required for k independent binary variables to ensure that the sample prevalence of each variable is within ϵ of the population prevalence with confidence 1−δ can be found by considering the worst-case scenario (the smallest p(i)) to ensure uniform confidence across all variables. Therefore, the sample size n is given by:

Lower bound on n for k independent variables

If the variables are not independent one can achieve a lower bound.

Extending to Variables with Multiple Outcome Possibilities

If the variables in question can take on l possible outcomes, with each outcome j having a probability p(j), we need to ensure that the sample proportions for each outcome closely match the population proportions.

For each outcome j, we can use a similar approach as with binary variables. The probability that the sample proportion for outcome j deviates from the population proportion p(j) by more than ϵ can be controlled using a concentration inequality such as the Chernoff bound.

The Chernoff bound offers a more stringent limit than tail bounds based on the first or second moment, such as Markov’s inequality or Chebyshev’s inequality, which only produces power-law bounds on tail decay. However, applying the Chernoff bound to sums necessitates that the random variables be independent, a condition not required by Markov’s or Chebyshev’s inequalities.

**Herman Chernoff** (born July 1, 1923) is an American mathematician

To extend the sample size calculation to l outcome possibilities, we need to consider the combined probability of deviation for all l outcomes. Using a union bound, we get:

Using the conservative bound for each outcome:

Summing over all l outcomes and ensuring the total probability of deviation is less than δ, we have:

One can solve by doing some algebra and get that, the minimal sample size n required for l outcome possibilities, ensuring that the sample proportions for each outcome are within ϵ of the population proportions with confidence, is:

The limit of the sample size for non-uniform probability

Conclusion

This comprehensive approach ensures that Sarah’s research proposal would be statistically sound, regardless of whether it involves binary variables, multiple independent binary variables, or variables with multiple outcome possibilities. Sarah was very happy with the clarity and depth of the solution, which would strengthen her research proposal significantly. She even remarked that buying me the coffee was worth it. This method can be generalized to other probability distributions such as Gaussian and Poisson. The major limitation is the assumption that the probability is known and relatively easy to work with, which is not always the case.