Proportion Estimation and its Discontents

Jireh Tan
5 min readFeb 6, 2018

--

One of the first things we learn in Stats 101 is how to estimate proportions using p-values and confidence intervals. A toy example is usually offered, with the TA taking a conspiratorial — perhaps even slightly apologetic — tone: imagine we’re tossing a coin, and we want to know whether it’s fair (p=0.5).

We then go through some rote. The TA’s eyes may gloss over as he starts chanting, much like a priest performing the last rites 😵

  1. Toss the coin many times, noting the number of successes and failures
  2. With these quantities, use the normal approximation to the binomial
  3. (mumble something about n and p)
  4. Look up the values in a z-table to get confidence intervals and p-values

And we’re done! (If you’re fancy you might apply some continuity corrections for “small” n, but basically you’re Stats-101-done.)

Do you approve of my frown turned upside down, turned upside down?

Because I enjoy being extra petty, here’s an example. Imagine you’re interested in Donald Trump’s popularity, so you call up 2000 people and ask them if they approve of his presidency. Only 378 people say that they approve. So what’s the approval rate? You apply the TA’s formula. A flurry of p-hats and ns later, you get a mean of 0.189 and a confidence interval of (0.1718, 0.2062). Big!

Armed with this knowledge, you go out into industry and start estimating proportions. In particular, instead of moving to Pew Research, or Gallup, you decide on going to a startup, and you’re tasked with estimating clickthrough rates on a given button on a website. How droll. You get these numbers: 19902 people visited the website. Only 2 people clicked on the button. You apply the formula, and you get a mean of 0.0001004924, and a confidence interval of (-0.000038773408, 0.0002397582). What? Isn’t a proportion supposed to lie between zero and one? Sad!

What’s going on here? To understand why this is the case, we have to take a step back and look at point 3, the TA’s mumblings about n and p. But before we do that, let’s take a step back and think about what a confidence interval is.

Confidence Intervals: The Recap No One Wants to Read 😢

Let’s first consider what a confidence interval is. The textbook definition (according to Wikipedia as of January 2018) is

is a type of interval estimate (of a population parameter) that is computed from the observed data. The confidence level is the frequency (i.e., the proportion) of possible confidence intervals that contain the true value of their corresponding parameter. In other words, if confidence intervals are constructed using a given confidence level in an infinite number of independent experiments, the proportion of those intervals that contain the true value of the parameter will match the confidence level.

Thanks, Wikipedia, super helpful. Not opaque in the slightest! 🙄

What Wikipedia means to say is as follows. Let’s say we’re interested in some unknown quantity, call it X. So you collect data regarding X. Imagine, then, that you have a procedure that, given any set of data, generates an interval on the real line. But you could always have gotten a different set of data! (Going back to our Donald Trump example, you could have phoned a different set of 2000 people.) If you were to repeat your data collection exercise a large number of times and apply the same procedure, you’d get a different interval each time. As long these intervals resulting from the different data cover the true quantity X at least θ% of the time, these intervals may be called θ% confidence intervals.

The issue is that any given procedure has its benefits and drawbacks. Let’s take a simple example. Say we’re interested in a 95% confidence interval for Trump’s approval rating. What’s a procedure that that would generate intervals covering the true approval rating at least 95% of the time? Here’s one: regardless of the data you get, return the interval [0,1]. Since the true approval rating must lie between zero and 1 inclusive, you’re definitely going to cover the true mean 100% of the time.

What’s the issue with this procedure? It’s… not very informative. 😑 But it is a confidence interval!

(I like giving this example of the [0,1] interval because it defamiliarizes the concept of a confidence interval, which is a pretty counterintuitive idea that people think that they understand, but they don’t really.)

Confidence Intervals Are Not God Given: Assess Them!

We’ve implicitly one of the most important aspects of confidence intervals: coverage. That is, if you advertise a 95% confidence interval, it had better cover the true mean about 95% of the time.

Any more, and you could probably stand to gain by making the intervals narrower. Remember, the widest interval for estimating a proportion is [0,1], but that’s not interesting, because it’s so wide. You should figure out a way to make the intervals smaller, so that you have a better idea of where to believe the proportion is.

Any less, and you fail the definition of confidence interval, and you are more likely to make a false positive, because your intervals are too narrow. You need to figure out why your confidence intervals aren’t doing as they’re supposed to.

Unfortunately, it’s hard to assess true coverage, unless you’re running simulations. (A/A tests, for example, are one way to assess coverage.) But here’s one thing I’m telling you now: if you are assessing a proportion, p, where p is super tiny, like a baseline rate for a conversion, or sudden cardiac death in healthy adults, n has to be extremely large in order for the confidence interval produced by the normal approximation to have an appropriate coverage. That’s because the underlying variance is so large relative to p that you need to accumulate a lot of information before you have the advertised coverage.

(Intuitively speaking, if an interval places any confidence at all that a proportion could be negative, then that interval can be improved by figuring out strategies to shift that confidence into the [0, 1].)

Further Readings and More Suggestions

I find the work of Tony Cai at Penn really useful at concretizing understanding of confidence intervals for proportions, and there’s a really cool discussion on coverage and best practices for p.

When I find the time, I’ll discuss the Bayesian approach to estimating p, which turns out to have a more intuitive understanding for the statistical dabbler: credible intervals!

--

--