Inference, estimates, p-values, and confidence limits

A frequentist approach

Dr. Marc Jacobs
6 min readSep 28, 2021

We conduct experiments to fulfill our intrinsic need to observe, know, and control. By analyzing the data collected, we hope to answer questions and test hypotheses that extend the knowledge we have about the world and its processes. A little bit more, each time.

In statistical frequentist theory, this hunting exercise is called inference, which is synonymous for extending your sample findings to the population from which the sample is supposed to come.

“Do the findings from my experiment and model resemble a ‘real world’ effect?”

The catalyst for the explosive use of the frequentist paradigm to statistics is the p-value or probability value. Despite its continuous base, and its dependency on the underlying probability distribution used, p-values are often used as a dichotomy: a p-value equal or below 0.05 indicates a real effect, a p-value above 0.05 indicates chance.

An interpretation that is more true to its nature is that a p-value of 0.05 indicates an accepted 5% or lower probability that the significant estimate (or difference) found is not ‘real’ (False Positive Rate). Often, but mistakenly, one can then be heard saying that the difference found is ‘due to chance’.

“A p-value equal or below 0.05 indicates a real effect, a p-value above 0.05 indicates chance.”

Whether truth or false, this dichotomy is still widely accepted and often a necessity for publication and receiving research grants, which has led to an equal continuous form of interpretation.

The usability of p-values has gotten way out of control.

These ‘interpretation guidelines’ are both humorous and painful to read, as the p-value trap is all to real. I, for one, have fallen prey to it more than once, as its alluring in its simplicity and consequences. In addition, by using contemporary statistical software, it has been made far to easy to deploy sophisticated algorithms in search for that coveted p-value boundary. One can easily forget that the appropriateness of p-values (if appropriate at all) rests completely on the appropriateness of the model and probability distribution used.

“ By using contemporary statistical software, it has been made far to easy to deploy sophisticated algorithms in search for that coveted p-value boundary.”

For instance, most models assume that the error part is made up of observations of values that are i.i.d: independent and identically distributed random variables. This means that when compare your model predictions to your observations, the differences are based on nothing at all.

Normal and homogenous studentized residuals in a set of 1000 simulate data points via the Normal Distribution.

Once your estimates can be trusted to at least not violate the assumptions of your model — which mimics the assumptions of the probability distribution used to analyze your data — most researchers deem it time to look for the p-values. Luckily, the majority of statistical software available also automatically provides the confidence intervals.

In short the confidence limit represents the range of results I can expect if I would repeat the same experiment and analysis a given number of times. For instance, a 95% confidence interval provides me with a range of plausible values that I might expect in 95 out of 100 repeated trials.

Personally, I consider confidence intervals to represent the redeeming side of p-values, but they are not without caveats of their own. For sure, they have been just as scrutinized. Nevertheless, if you are to use the frequentist approach to statistics, I would rather see you use confidence intervals than p-values.

Confidence Limits, but not estimates of the mean, are more susceptible to sample size change. Even in a perfect i.i.d simulation, created by N=100k simulations.

Without falling too much into the trap of defining what confidence limits are, or are not, it is perhaps easier to just simulate how p-values and confidence limits are two sides of the same coin. This means that if I accept a p-value of 0.05%, I should also use 95% confidence limits — or 0.95% CI. The reason is that they are both bound to the same false positive rate that I accept.

To show how estimates, p-values, and confidence limits interact, I simulated five treatments by drawing 10 samples per treatment from the same Normal distribution (mean=791.5 , variance=1000). I then compared treatments 1 and 2, using a p-value of 0.05% and 95% confidence intervals. This process was repeated 50 times leading to 50 comparisons of treatment 1 vs treatment 2.

Black squares are p-values based on the 0.05% cut-off, red lines are 95% confidence limits, and the green line is a ‘significant’ difference based on the 0.05% p-value.

Despite the small number of repetitions and simulations, the picture above is able to clearly highlight several issues inherent to both p-values and confidence limits:

  1. A single significant effect is found where there should be none. This is actually less then what we would expect given the 5% False Positive Rate we allowed in our simulation. Welcome in the world of simulation and probability!
  2. P-values are inherently uninformative. Not only can different confidence intervals — different in placement, not size — have the same p-value, the p-value in itself does not really tell you anything about the size of the effect.
  3. Significant p-values can have confidence limits that are close to the border of ‘non-significance’, which equals zero. This means that a p-value, even if deemed ‘significant’, needs to be considered in a more holistic view to assess the usefulness of the estimate it is attached to. Hence, it could very well be that a significant effect is not clinically relevant.

“A p-value, even if deemed ‘significant’, needs to be considered in a more holistic view to assess the usefulness of the estimate it is attached to.”

I already mentioned that the relationship between the p-value and the confidence limit resembles two sides of a coin. Below you can see what happens for the interpretation of the estimates, if I do not adjust the level of the p-value to the level of the confidence limit and vice versa.

Not changing the level of the confidence limit to mimic the level of the p-value means that we will find ‘significant’ p-values combined with non-significant confidence limits and vice versa. I short, both metrics will have different false positive rates. Due to the small number of repetitions and simulations, the theoretical false positives are not achieved.

Personally, I do not like p-values at all, but I do recognize they are still being used a lot, and that researchers do depend on them. Hence, I believe it be foolish of me to create a post in which I promote the abolishment of the p-value, before first showing its inherent limitations, and its connection to the confidence limit which I consider to be a far more informative metric.

So, if you are not able to get rid of the p-value, I hope you remember that p-values and confidence limits are entwined, but that the latter is intrinsically far more informative. In effect, the confidence limit provides you with a range of plausible values, considering your model assumptions are met.

Now, you need to convince journal editors to get rid of mean (standard error) tables, and start showing and reporting confidence limits instead.

--

--

Dr. Marc Jacobs

Scientist. Builder of models, and enthousiast of statistics, research, epidemiology, probability, and simulations for 10+ years.