Dall-E’s take on confidence intervals

Confidence intervals for the real world (and real people)

cd (Cristian Dagnino)
bain-inside-advanced-analytics
5 min readJul 17, 2023

--

Statistics doesn’t require you to follow a rigid recipe that mechanically takes you from raw data to a conclusion. Instead, it is more like a playground with certain boundaries — a football field for example.

Using statistics involves understanding a host of principles to weave a coherent narrative from your evidence or data, with freedom to make moves and passes as you navigate the field, but, just as a football player cannot leap into the stands to continue their play, the principles of statistics have certain boundaries — biases to avoid, assumptions to verify, and pitfalls to be wary of.

This is particularly the case when using confidence intervals and their twin sister, the p-value.

In academia, there are stricter rules on what is accepted. For example, having very low p-values (typically <0.05, which corresponds to a 95% confidence level) is seen as a necessary and sometimes even a sufficient condition for publication. These rules can be justified in their setting¹, but don’t necessarily translate to the business context.

Why is the business context different?

First of all, in a business context we take decisions even when we aren’t sure about them. Sometimes knowing that A is more likely than B can be enough. This is similar to how some legal cases require a burden of proof such as “more likely than not”, whereas criminal trials would require a “beyond reasonable doubt” standard; a confidence level of 75% might be good enough in business settings, but it wouldn’t be enough in an academic one.

Secondly, in a business setting we care more about the practical consequences / real impact and not just if the effect is significant or not. In academia, documenting a significant effect can be important and advance our knowledge, even if the size of the effect doesn’t have many practical consequences.

I’ll show you a simple typography of confidence intervals that can illuminate these differences.

Promotion Experiment

We’re loosing our cupcake customers, so the Marketing Department proposes we do an experiment. The treatment group gets a 2x1 cupcake voucher through e-mail.

We run the experiment for a month and get the results back. We estimate the extra sales due to the promotion. The effect could go from -2 (people hated our e-mail and in fact bought even less cupcakes) to 2 (people in the treatment group bought a lot more). The numbers I’m using are just examples, but let’s assume that we only care about positive effects over 0.5, because anything below that wouldn’t cover the costs of running such a promotion. In other words, if the true effect is, say, 0.3, then that’s the same as 0 in practical terms.

I’ll show you six possible confidence intervals² you could get from an experiment such as this and how to interpret them.

Six types of confidence intervals, drawn at 90% confidence level

The blue ones (a, b, c) are not statistically significant (since they include the zero effect) and the green ones have a statistically significant positive effect.

If we were writing a paper, we could say of all the green ones: “we found a statistically significant positive effect”, so it might be tempting to conclude that the green ones are similar to each other, while the non significant blue ones are also similar to each other. However, for practical purposes this is completely untrue.

What are “practical purposes”? Based on our experiment, suppose we can only take three actions:

  1. Scale up the promotion experiment: we have good evidence that the effect is relevant to us.
  2. Cancel the experiment: we have good evidence that the effect is not relevant to us.
  3. Continue experimenting: we don’t have enough evidence, let’s continue exploring.

Thus, I’m using “practical purposes” to mean that we only care about evidence that helps us choose between these three actions. Let’s look at the six cases:

Case (f) is easy. We have estimated a large effect with high precision: of course we’ll scale up the promotion experiment in this case.

Case (a) is what we call a “precisely estimated zero”. Even though (e) is statistically significant, in practical terms it is equivalent to (a): we have a lot of confidence that the promotion wasn’t a good experiment, so both cases lead to cancelling further experiments.

We might think that case (b) is similar to (a): they are both not statistically significant and their mean is zero. However, with (b) it could just be that we don’t have enough data points or that we need a clearer experiment. The effect might be zero, but it also might be positive, we’re just unsure and our best option might be to continue experimenting.

Case (c) and (d) are quite similar, even if one is statistically significant and the other isn’t. They both seem to show that the effect is big enough, but the evidence isn’t 100% clear. We would probable choose to continue experimenting in both of these cases.

That (d) is significant while (c) isn’t is just a matter of the choice of the significance level. If we used a 95% confidence interval, both would be non significant, whereas an 85% confidence interval would make both significant.

This typography also shows one big advantage of the confidence intervals over using just p-values: we wouldn’t be able to distinguish between all these 6 cases using just p-values.

For more details on interpretation of confidence intervals, you can check these references: [3], [4], [5]

Takeaway

Statistics is a powerful tool, but we need to understand how our particular setting in order to use it. In a business setting we should:

  • Be flexible about p-values / confidence levels and, sometimes, you should lower your significance level requirements. Sometimes A being more likely than B is enough to make a decision.
  • Be very conscious of the size of the effect and distinguish real world significance from statistical significance.

I hope this simply typology of confidence intervals can help you use confidence intervals to take better decisions!

[1] For example, we can be worried about p-hacking, so we want to be conservative in what we accept as a conclusion. In many business settings there are less incentives to test multiple hypothesis until you get a significant result.
[2] I’m only considering positive effects. For simplicity, I didn’t include the “clearly negative” confidence intervals, but they can be added to complete the typology.
[3] Confidence Intervals: Linking Evidence to Practice
[4] Confidence Intervals and p-Values
[5] Understanding and interpreting confidence and credible intervals around effect estimates

--

--