Significance tests, confidence intervals, and effect sizes: setting it straight

Few things are more misunderstood in statistics than everything that has to do with significance testing, p-values, and their proper interpretations. In this post, I aim to clarify what these things are and how they are interrelated — if not to set the record straight. I’ll use examples and intuition rather than abstract reasoning and equations in my normal non-technical manner, but first a few prudent words about the data I use to exemplify. This data set stems from a recent paper of mine in the journal Managing Sport and Leisure, which you find here in open-access form. The data in question pertain to 310 football (i.e., soccer) players in the top-tier Norwegian football league in the 2022-season (excluding the goalkeepers). Note! These data are population data. Accordingly, I am going to treat these data as a random sample of some imaginary superpopulation of football players. The reason is straightforward: since the point of doing inferential statistics (e.g., significance testing, calculating confidence intervals etc.) is to say something about an unknown population based on known sample data, we need to analyze sample data! That is, if we have access to population data there is no need for inferential statistics in the first place. (Actually, this stance is somewhat debated, but that’s my take on it anyway.)

Significance tests in the Null Hypothesis Significance Testing-era (NHST)

The NHST-approach boils down to examining if there is some kind of systematic association between x and y in the unknown population by means of analysis of sample data for x and y. Let’s say we are interested in the association between number of matches played in career (x) and number of goals scored in career (y). Figure 1 presents this analysis in terms of a scatterplot and a regression line. The slope of this line — the regression coefficient — is 0.135: one more match played entails almost 0.14 more goals on average. More matches obviously create more opportunities for scoring goals. The R-squared is 0.31. (If you are rusty on regression analysis, please see my primer here.)

Figure 1

In the NHST-setting, the p-value answers whether we can generalize this regression coefficient of 0.135 to the population of interest. In our analysis, this p-value is less than 0.001 or less than 0.1 percent, making it “statistically significant” at all conventional levels used in statistical research. And here the problems start. That is, what does such a very low p-value mean? Well, for one thing it does not have anything to do that with a small chance of being wrong when we argue that our sample regression coefficient also is present in the population. Not by a long shot, sorry. The p-value only concerns the null hypothesis stating that the regression coefficient in question is zero in the population. In other words, what we want to know is how close our regression coefficient of 0.135 is to its unknown counterpart; what we get to know is the probability of getting a regression coefficient of 0.135 in a sample given that the regression in the population from which the sample is drawn is zero. (For a non-technical introduction to significance testing in a regression framework, see my post here.)

The problems with the NHST-approach do not stop here, however. For example, in large samples, almost every regression coefficient gets statistically significant (in the low p-value sense of the term) merely because of many observations. Furthermore, the size of the p-value does not tell us anything about the magnitude of the effect/regression coefficient. For these reasons (and some others we don’t bother to go into), many researchers have advocated that the NFST-approach should be abandoned altogether. What, then, should we report instead? The answer according to many is confidence intervals (CIs) and effect sizes. More on this below, continuing in the regression context.

Confidence intervals (CIs)

I’m a big fan of reporting CIs, but let’s get one thing straight for starters. In calculating a CI, we use the same numbers as when calculating a p-value. In a sense, thus, a CI does not contain any new information compared to a p-value. That sorted out, a CI nevertheless tells us something about the probable magnitude of the regression coefficient in the population. In our football case, the 95% CI is [0.113, 0.158]. A tempting interpretation of this CI is that, with 95% confidence, the regression coefficient in the population is somewhere between 0.11 and 0.16. Yet this interpretation has one small problem: it is plain wrong. The regression coefficient for the sample either lies in the 0.11–0.16 interval or it doesn’t! What the CI means is this: if we were to repeat the study many times over (i.e., using multiple sampling from the same population) and calculate the CI for the regression coefficient each time, we would expect 95 percent of the CIs to contain the true coefficient in the population. Again, what we want — the size of the regression coefficient in the population — is something rather different than what we get. (Bayesian statistics might have something to chip in here, but let’s not go there.)

Despite the obvious shortcoming of the CI, it might provide some intuition about the magnitude of the regression coefficient in the population. And it might be useful to show the CI in a graph. Figure 2 does this for our football regression. One feature is especially important, namely that the CI is wider in the areas for which there are fewer observations (players). This makes intuitive sense. When there are fewer observations present, we should expect more uncertainty as in wider CIs.

Figure 2

Let’s take stock: (1) A significance test does not tell us anything about the magnitude of a regression coefficient — that is, if the effect in question is strong in a substantive or practical sense. (2) A confidence interval (CIs) does not provide a firm answer regarding the magnitude of the regression coefficient in the population. What should we do then? Enter effect sizes.

Effect sizes

In our regression framework, effect size considerations involve comparing a regression coefficient to a ready-made and established (rule-of-thumb) criterion of what determines a small, medium, or large effect. For our bivariate case, this entails calculating the correlation coefficient between number of matches played in career (x) and number of goals scored in career (y). This correlation coefficient is 0.56. The thresholds for a small, medium, or large effect are, respectively 0.20 (or -0.20), 0.50 (or -0.50), and 0.80 (or -0.80); see here. Our effect might therefore be labelled as medium-sized. Effect size calculation might illuminate practical or substantive significance (at least in my opinion), especially absent any external information most likely to be found in prior research.

Extensions

The reasoning above in this post applies equally well, if not better, to the comparison of two means (by ANOVA or t-test), but that’s something for a later post.

Takeaways and implications

Hypothesis testing within the NHST-framework and its associated p-values are marred with limitations and misunderstandings. Confidence intervals are no panacea in this regard, but they certainly adds insight regarding the magnitude of the x-y association in the population. Effect sizes takes such considerations one step further.

About me

I’m Christer Thrane, a sociologist and professor at Inland University College, Norway. I have written two textbooks on applied regression modeling and applied statistical modeling. Both are published by Routledge, and you find them here and here. I am on ResearchGate here, and you also reach me at christer.thrane@inn.no

--

--

Christer Thrane (christer.thrane@inn.no)

I am Christer Thrane, a sociologist and professor at Inland University College, Norway. You find me on ResearchGate. I do lots of regression modeling ... :-)