How to calculate accurate sample size requirements by modeling an effect size distribution

To determine required sample sizes using an a priori power analysis you need three values: a significance criterion, a level of statistical power you would like to achieve, and an effect size.

The first two details are usually straightforward. Under most circumstances, researchers set their significance criterion at α = .05, and statistical power at 80%. However, the effect size can be a bit harder to determine.

Let’s have a quick look at the influence of three different effect sizes (d = 0.2, 0.5, and 0.8) on the required sample size for a two sample t-test by using the ‘pwr’ package in R — here’s the script.

The output from this script reveals that to achieve 80% power, I would need 393 participants per group for an effect size of 0.2, 64 participants to for an effect size of 0.5, and 26 participants for an effect size of 0.8.

Here’s a visualisation of these effect sizes using a script from R Psychologist, which I’ve modified.

Left figure: d = 0.2; centre figure: d = 0.5; right figure: d = 0.8

Why did I chose these three specific effect sizes? Like most researchers, I used Cohen’s guidelines for what constitutes a small (d = 0.2), medium (d = 0.5), and large (d = 0.8) effect size. Cohen proposed that a medium effect size should represent the average effect for a given research area (i.e., the 50th percentile). He also suggested that small and large effects should be equidistant from the medium effect (i.e., the 25th and the 75th percentiles, respectively).

If the average effect is 0.5 and the range of effect sizes is normally distributed, then Cohen’s recommendations are spot on. But Cohen never intended for these guidelines to be used as a one-size-fits-all approach. What if the average effect for a particular research area is d = 0.8? In this case, an effect size of 0.5 wouldn’t accurately represent a medium effect size.

This also has implications for sample size estimation. If the true medium effect size is closer to d = 0.7 then sample size estimation using Cohen’s guidelines won’t be accurate.

By collecting effect sizes and constructing an effect size distribution (ESD), it’s possible to accurately determine small, medium, and large effects by identifying the 25th, 50th, and 75th percentiles of the ESD.

As a heart rate variability (HRV) researcher, I was interested in the ESD of case-control studies. There are hundreds of HRV case-control studies, which would have taken weeks to go through to extract all the effect sizes. Fortunately, there are 17 published meta-analyses on the topic that allow the easy extraction of effect sizes from individual studies.

In a new paper published in Psychophysiology, I calculated the ESD of 297 heart rate variability (HRV) effect sizes extracted from these meta-analyses.

A histogram of 297 HRV effect sizes from case-control studies with the 25th, 50th, and 75th percentile shown

I found that Cohen’s guidelines for a medium effect size are almost spot 0n — the 50th percentile in the ESD was d = 0.51. However, the ESD revealed that using Cohen’s guidelines would be an underestimation for small and large effect sizes, as the true effect sizes for these effects were 0.26 and 0.88.

Effect sizes and required group sample sizes using data from the HRV effect size distribution (n = 297)

On first blush, these numbers don’t seem that far off from Cohen’s guidelines. However, recalculating sample size requirements using results from the ESD reveals that these differences have important practical implications.

Based on the ESD, ONE HUNDRED AND SIXTY FEWER participants per group are required to achieve 80% power for a small effect than if I were to use Cohen’s guidelines. That is an enormous number, especially if you work with populations that are difficult to recruit.

These recommended sample sizes are based on all HRV case-control studies, which included a mixture of disorders. To be a bit more specific, I also calculated ESD percentiles for a number of subgroups.

Not specific enough? Here’s the dataset and the R code if you just want to look at the social anxiety ESD, for instance.

This ESD analysis is especially good news for me as I’m now working more with psychosis spectrum disorders, which had a 50th percentile of d = 0.81.

Medium effect sizes for psychosis spectrum disorders using Cohen’s guidlines (left; d = 0.5) and the ESD (right; d = 0.81)

Instead of recruiting 64 participants per group for 80% power to detect a medium effect, turns out I only need to recruit and test 25 participants for a case-control study. This ESD will not only save me considerable resources, but also reduce the testing burden on the local population of individuals with psychosis spectrum disorders.

A final thing to note is the potential influence of publication bias and p-hacking on the ESD. There’s a chance that some studies have been left in the file drawer or been p-hacked, which would distort the ESD. To check whether statistically significant effects are over-represented in HRV case-control research, I constructed a one-sided contour-enhanced funnel plot. As statistical significance can be calculated with a combination of effect size and standard error, it’s possible superimpose key levels 0f statistical significance (p = .1, p = .05, p = .01) on a funnel plot.

A one-sided contour-enhanced funnel plot of 297 HRV case-control effect sizes

There doesn’t seem to be an over-representation of effects in the orange and red significance contours and there are also plenty of studies with p-values greater than 0.1. So together, there’s not much evidence of p-hacking or publication, it least from an inspection of this plot.

Accurate effect size estimation is a crucial element of study planning. With too few participants, your study will underpowered and less likely to replicate. Recruiting too many participants is simply a waste of resources.

By making the R script available to model the ESD, I hope that other researchers can use this approach to calculate accurate sample sizes for their own work.


Made it this far and would like to hear more? I discuss this paper in a recent episode of Everything Hertz, a podcast I co-host with James Heathers. Subcribe to Everything Hertz via iTunes or your favourite podcast app.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.