How many participants should you optimally have for your usability study? Is 5a large enough sample size, or should anything more than 30 be considered scientifically valid? Are there any heuristics for arriving at an optimal sample size?
During the 18th century, scurvy was a major cause of deaths among sailors. In a particularly disastrous voyage by George Anson (in an attempt to circumnavigate the earth), out of the 1800 sailors that began the journey, only 188 survived.
Most of the sailors succumbed to scurvy while at sea. At that time, the cause of scurvy as well as the concept of vitamins was unknown.
In what is considered one of the first reported, controlled, clinical experiments in the history of medicine, John Lind, a Scottish physician in the Royal Navy, developed the theory that citrus fruits cured scurvy. This discovery gradually led to the near-eradication of scurvy that had caused the death of 2 million sailors between 1500 and 1800 A.D.
What’s interesting is that John Lind brought about an end to this dreaded disease with a study that he conducted on 12 sailors; i.e. a sample size of just 12.
How’s that for a sample size of an experiment with a significant scientific discovery?
Usability testing is narrower in scope compared to other kind of studies such as User Experience Research or Market Research. Usability testing is about behavior (i.e. how a user interacts with your product) while market research is about opinions and preferences (e.g. what a user feels about your brand colors).
Typical usability studies may include determining reasons for drop-offs from a checkout page, testing if users can successfully complete the process steps of a feature, or testing if the internal search in a website yields relevant results.
A usability study should be able to discover the most pertinent usability problems, at the minimum cost; and therein lies the challenge. Minimum cost, usually equates to saying minimum number of test participants, since costs are likely to increase linearly with number of participants. However too few participants can lead to erroneous results which will end up doing more harm than good.
Thus we arrive at the question at hand: How many respondents should you optimally recruit for your usability study?
Sample Size Considerations
1. Should you conduct a usability study at all?
The answer is unequivocally, yes. It is difficult to develop a case for something that cannot benefit from a usability study.
Developers might write thousands of lines of code and push it to production. However for customers the feature might as well be invisible if they cannot understand what it does or how they should use it.
2. Adhering to scientific principles
A common snap answer to the question at hand might be 30, since it is widely regarded as a large sample size, quoted by many textbooks and research papers.
However the choice of 30 as an acceptable large sample size is quite arbitrary (more arguments). This answer claims that the number 30 gained popularity because it made students textbooks pretty by fitting everything in one page.
Thus though 30 is a valid answer, it is as valid as 20 or 65. Larger sample sizes generally lead to more accurate results. Regardless of sample size it would be useful to check the statistical significance of your results.
In a usability test, we are interested in proportions — for example the proportion of people who failed to complete a task at all or within a certain amount of time. When using sufficiently large sample sizes, the Wald method for the binomial distribution gives us a simple formula to derive the required sample size with a 95% confidence level within an acceptable error margin.
Required Sample Size = 1/B² (where B is the acceptable error margin)
Thus if we are fine with a plus or minus 10% error margin, we need a sample size of 1/(0.1)² = 100.
To elaborate with an example, let’s say you want to test if your users can successfully use a coupon during checkout. Let’s assume you will be worried if less than 80% of users are able to do that. You run a test with 25 subjects, and discover that 14 of them i.e. 56%, could complete the task. A sample size of 25 corresponds to a error margin of (+/-)20%. Thus our true success rate is likely to lie between 36% (56–20) and 76% (56+20). Even the upper bound of this interval falls below our acceptable level.
Though this is a somewhat convenient example it should hopefully serve to explain the method above.
You can also use a sample size calculator to get a more accurate estimate.
Pausing to reflect for a moment, should we treat every usability study, even a small one, with the same rigour as a scientific study, or can we afford to be more forgiving?
There is the question of cost; but it would be pointless to conduct a study with unscientific methods thus rendering the validity of the results questionable.
3. Literature and expert opinions
What do industry experts say about the number of test subjects? Jakob Nielsen speaks of 5 as an acceptable sample size. He cites diminishing returns of usability problems, with increase in sample size. However according to the Wikipedia article this has been empirically and mathematically challenged.
For qualitative research and thematic analysis, a sample size of 6 to 400+ has been suggested depending on the data collection methods and the size of project. It is also suggested that the number should not be decided at the outset but be arrived at as the study progresses.
4. Criticality of Task and Business Impact
The more critical the task being tested, the more care should be taken for its testing methodology. In this article by Carlos Rosemberg he has described an ingenious method to deduce priority based on task criticality, issue frequency and issue impact.
The basics that should work for everyone — Anyone can call out the error of not having language options for a website intended for a global audience. It’s a no-brainer. For someone with a usability background, having non-standard page elements that are not readily recognisable might fall in the same category of no-brainer usability problems.
Similarly, page basics such as text readability, standard web elements, affordances, element grouping, navigation should work for anyone with some prior exposure to web pages. Even with a small sample size, failure with usability of these factors may be flagged as valid problems.
We can thus talk about a category of page elements that can be tested for usability problems even with a small sample size.
5. Convincing other stake holders
Most of the times you are working in a team and cannot take decisions by yourself. Even when you are convinced about a certain method or a finding, you need a buy-in from superiors and colleagues to fix them. You must be able to defend your method when challenged and the best way to do that is by proving adherence to scientific methods.
6. Recruitment methods
Some oft-suggested methods such as “guerilla research” and “hallway recruiting” are likely to be rife with sampling errors and cognitive biases. For example, if you are “intercepting” people walking by your work-station, you may approach friends and colleagues more often than strangers.
While it is “quick and dirty”, it might not be the most valid.
It is important to note that for recruiting, it’s not just the “who” but more importantly the “when” — the importance of catching the user in the act of using your product, cannot be over-emphasized. For a website, “live recruiting” and web intercepts are thus more reliable methods.
The recruitment method may influence the sample size you decide on.
7. The costs of erroneous results
There are structural engineers who study and certify the structural integrity of a 100-storey building. Usability study of a website is a comparatively low-stake undertaking. Moreover it affords the luxury for continuous improvement or rolling back to a previous version if the fixes were found to be ineffective.
Knowing this could make us compromise the study methodology. Our approach would be a lot different if there were no option of rollback or if lives were at stake.
It is almost impossible to empirically come up with a number that holds good for every situation. The key, as it usually is, is to adapt to the situation.
a. Varying the sample size as per business impact and task criticality — It might be prudent to identify more critical aspects of the test and allocate more resources to them. A larger sample size is likely to give better results and thus preferable.
b. Deciding acceptable error rates and deriving sample sizes — You can use any statistical approach like the one mentioned above relevant for your study. Getting a p-value should give you confidence about your results. You can also make use of readily available sample size calculators.
c. Budgeting — Planning the study especially with regards to budget can avoid a bumpy ride later. Make sure you have an idea of what you are going to end up with.
d. Involve other stakeholders — Get a buy-in from other stake-holders apropos what can be considered as legitimated methods for the study.
e. Fixing other lower hanging aspects of the study, especially recruitment — Choosing more valid methods in other aspects of the study can likely bring down errors.
We hope this article will help you fine-tune your approach for determining the optimal number of participants in your usability studies.