Are Pilot Studies Screwed?

Michael Mullarkey
7 min readJul 9, 2018

--

Definitely and Definitely Not¹

Pilot studies are massively overrated and massively underrated at the same time. So how can we use them well? Hopefully by the end of this post we’ll all have some better ideas.

“Pilot study” feels like a vague term. The dictionary of epidemiology defines a pilot study as ““A small-scale test of the methods and procedures to be used on a larger scale.” Still feels vague, until you realize what’s missing from that definition: Any test of how well the treatment works.

Also missing: Humans temporarily winning the battle against gravity

It turns out the technical term “pilot study” isn’t that vague, we just might be abusing the term. Multiple sources emphasize that pilot studies are for determining whether we can effectively recruit, retain, randomize, assess, and implement our intervention in our target population. These sources also explicitly label testing the “effectiveness”² of the intervention in pilot studies as out of bounds, terrible practice. But what’s the harm in assessing whether the intervention worked? It feels a little weird to run a pilot study with the goal of treating depression without assessing whether you treated depression³.

Unfortunately, an estimate of how effective your intervention is from a pilot study will be so unstable that it’s useless at best. Why? Because there’s a confidence interval around any given effect size, and that confidence interval gets wider as the sample size goes down. The wider the confidence interval, the less that effect size can tell you about whether your intervention worked or not.

At least no matter how sad you are, you aren’t sad and also this terrifying clown doll

Just how wide are these confidence intervals at low sample sizes? Let’s be somewhat generous and say the pilot study involves 20 participants. They start out at a mean of 5.80 on our measurement of depression, with a standard deviation of 1.30. Over the course of the pilot study they improve to a mean of 5.30, standard deviation of 1.24. We’re thrilled, because our within-groups effect size is d = 0.40, meaning we’ve made people 4/10ths of a standard deviation better on the measure. This change is above and beyond the minimal important difference effect size of d = 0.24 that might indicate clinical significance.

But then we check the confidence interval around the effect size, and we realize we have basically no information at all. The effect size ranges from making people’s depression clinically significantly worse at d = -0.24 to a ginormous positive treatment effect size of d = 1.01. This gets even worse if you have only 10 people, with effect sizes ranging from -0.51 to 1.26. You can check behind me by downloading this online effect size calculator or using the compute.es package in R.

Even if we luck into a much larger effect size, we can’t trust it for planning our main study. If we use that effect size, our main study will end up way underpowered to detect the actual effect. Why might people use these inflated effect sizes? One of the first pieces of information that jumps out at me from these NIH guidelines is a recommendation to power the main trial to detect the minimal important difference between groups. In depression, that effect size is again estimated to be d = 0.24. We can see why people might want to use inflated pilot study effect sizes in power analyses, because a study with 80% Power to detect that effect size between two groups would require 548 participants. That’s not technically impossible, but it is incredibly logistically difficult to recruit that many participants. The temptation to use a pilot effect size of say, d = 0.80 becomes more understandable when you realize that effect size oks us to only recruit 52 participants.

This pup is tired of paragraphs chock full of numbers, so there won’t be any more in this post!

On the other hand, a harmful seeming or null “effect” in the pilot study might dampen enthusiasm and resources for a truly promising intervention. There’s no magical statistical technique that can get around the pitfalls of testing intervention effectiveness with so few people. As much as it pains me, this situation is probably one of the times where it’s better to do nothing (in regards to “just checking to see how the intervention did in the pilot”) instead of doing something.

So pilot studies are horribly overrated for testing whether an intervention “works” on the primary outcome. But they’re fantastically underrated for making sure our intervention “works” in a more day-to-day, process way. Not properly piloting can lead to breaches of trust with the participants we partner with and, in one high profile case, over a billion dollars wasted on data collection that isn’t scientifically useful.

The difficulties associated with improper piloting don’t have to have eye-catching dollar signs attached to cause a lot of negative ripples. The graduate student who has to try to make sense of a study with serious design flaws after the fact. The PI whose new favorite measure isn’t feasible to collect given the new study design. The staff members who have to work way overtime to make up for oversights in the participant visit structure¹⁰.

So how can we pilot well? Again, the NIH has some broadly useful guidelines here. They include a list of measures to make sure we’re tracking and including.

This list of measures seems like a great start, but there’s little information on how to run the pilot study day to day. I’m sure individual labs have great procedures, but I haven’t been able to find any that are open/publicly available. If I’m wrong, please let me know!

I suspect one reason for this is that whether a study is a pilot might not be decided on ahead of time in at least some cases. Jessica Flake talked at the APS and SIPS conferences about how most of our responses to the replication crisis have been addressing statistical practices. However, measurement and design are more foundational than statistics in our pyramid of evidence.

From https://osf.io/n2dzv/

Measurement is a topic for another blog post¹¹, but design is important here. It’s tempting to turn a “pilot study” into a “preliminary efficacy study” if there’s a statistically significant result on the main outcome. Prioritizing that possibility, and the publication that could come with it, might lead to designs that less effectively measure feasibility or other outcomes listed in the NIH graphic above.

For example, certain manipulation checks that are essential for feasibility testing may decrease the effectiveness of the intervention. That’s not a big deal if you’re truly running a feasibility study, but it leads to some tough decisions if you’re maybe running a “test” of the intervention’s effectiveness. I can’t know for certain that people de-prioritize measuring key feasibility measures in search of statistical significance on their main outcome, but the current incentive structures point researchers in that direction.

Bottom line: Pilot studies can be hugely helpful (and stay underrated) when they’re properly defined and used. In order for all of us to get better at running pilot studies, we have to admit whether they’re a pilot study ahead of time and acknowledge what those types of studies can¹² and can’t¹³ do.

I’m sure there are cool resources I’m missing here, so please tweet them to me @mcmullarkey so I can add them to this article!

1 ^ Alternate Subtitle: Because Only People Who Already Have a PhD Can Drop F Bombs in the Titles of Their Blogs

2 ^ The scare quotes are foreshadowing!

3 ^ Though depression measurement is a subject for another blog post, along with one Eiko Fried has already written

4 ^ Magically not measurement invariant, unidimensional

5 ^ I recognize within-groups effect sizes can (and often should) be much larger than between-groups effect sizes, but this type of example holds in many circumstances

6 ^ Yep, I referred to this earlier but I don’t want people to have to go back and read. Though I guess I now sent people down here which might take longer…

7 ^ If you’re wondering what Power is, that’s totally understandable, and here you go

8 ^ At least that I know of, though sequential analyses comes closest

9 ^ If anyone knows of any other hard data on the costs of not piloting well please let me know. I’ll admit this section is probably based too much on my intuition

10 ^ There are plenty of other problems with small samples not specific to pilot studies

11 ^ Check out my Twitter thread on Jessica and Eiko’s session at SIPS if you want a primer

12 ^ Feasibility!

13 ^ Testing intervention effects!

--

--

Michael Mullarkey

Clinical Psych PhD student. @mcmullarkey on Twitter. Something hilarious yet relatable