Who needs backup dancers when you can have confidence intervals?

Published in

IBM Design

8 min readAug 15, 2017

At IBM, I’m constantly struggling to increase my sample sizes of research participants. I mean, how many eligible Power Systems administrators with data center footprints less than 100 servers are out there and want to talk to me about baseboard management controllers? …. Anyone?

UX research on enterprise-class servers is not exactly a topic that you can waltz into Starbucks with and see if someone wants to take your survey. I’ve had to get creative over the years with how I can collect as valid of research as possible with the limited pool of users I can find. Most of my UX research is qualitative in nature, but I try to augment that as much as I can with quantitative metrics to substantiate some of my claims.

Qualitative research is great at telling us the why, but you need a mixture of quantitative research as well to tell you what is happening and what the biggest areas of opportunity are. For example, it can be helpful to collect something like a System Usability Scale (SUS) or KANO model as a benchmark to measure the baseline at the beginning of a project, and then measure again at the end in order to measure the impact created through design. Find out where the gaps are first, and then dig into why they are happening with more qualitative methods.

By using quantitative measurements as benchmarking tools, there can be some comfort in simply measuring directional improvement. Your score doesn’t necessarily have to be statistically valid if your SUS at the beginning of the project was a 36 and by the end was an 88; a gap like that indicates at least some kind of positive trend in the right direction.

However, something still irks me about presenting any numbers like these and having to confess that my sample size was only 10 users. It’s like you’re in the hot seat in front of a table of executives, making bold claims about why users think their product is crap, and then suddenly your argument starts looking a lot more hole-ridden, going hot to tepid in a matter of minutes.

So how can you substantiate your quantitative benchmarks (especially those with smaller sample sizes)? Calculate a confidence interval!

Disclaimer: I am by no means a statistics expert, and am only going to walk you through an example with step-by-step instructions for calculating a 95% confidence interval. I’ll include some other great resources at the bottom that have a much more in-depth explanation and robust tutorials on confidence intervals as well!

What is a confidence interval?

A confidence interval is simply a way to measure how well your sample represents the population you are studying. Think about it: the smaller the sample size you have, the more variable the responses will likely be. So how do those responses represent the entire population? I can’t talk to all 100,000 users in my market segment, so how can I use the responses from the 30 people I did talk to in order to hypothesize what all 100,000 might say? And if I talk to 30 other people in a second study, how likely is it that their responses will be the same as the first 30 people I talked to?

So, the first lesson to remember: the smaller the sample, the more variable the responses will be and the bigger the margin of error. If you talk to only 2 people, their answers could be complete opposites! The more people you talk to, the more representative your data will be. The margin of error accounts for the range that your calculation can vary; every time you’ve seen someone cite in a study that “the result was X, plus or minus 8%,” that is the margin of error.

You can calculate confidence intervals at varying degrees of confidence. A 95% confidence interval is pretty standard, but depending on the scenario you might want to have more or less confidence. For example, if you’re designing a piece of emergency medical equipment, you want to be as certain as possible it will work for 99% of the population you’re building it for. If you’re conducting research about a new pattern of trout fly, you might need to be only 80% confident that it fits with most of the population’s needs. For most software-related projects, a 95% confidence interval should suffice.

*Visually, you can see that if you require higher confidence, your confidence interval will be wider.*

Your confidence interval will get wider with the following situations:

Higher variability in the sample
Higher confidence required
Smaller sample sizes

I’ll walk you through all of the steps you need to get through in order to calculate a 95% confidence interval for continuous data, with an example scenario.

What a 95% confidence interval looks like; the yellow area represents the total population.

One of the first things you should think about when calculating a confidence interval is what kind of data you have:

Continuous data (Likert scales, ratings, anything that is non-binary) will require you to calculate your confidence interval using the average (mean). (I’ll go through an example of this type below).
Discrete binary data (1 or 0, yes or no, pass or fail, etc.) will require you to calculate your confidence interval using the proportion.

Example Scenario

Let’s say that you conducted a survey for your current project, and on your last question you had users rate your product on a 7-point Likert scale from very positive (1) to very negative (7). You need to vet how representative your results from your 11-person sample are in the greater population of users.

Since this is a continuous data set, there are 4 steps we’ll follow, and (don’t worry!) I’ll take you through each one. Follow the process below in order to calculate your confidence interval, and figure out how these results might sway a bit should you collect a different sample.

Average (mean) (If this was discrete binary data, you would use a proportion here)
Standard deviation
Standard error of the mean
Confidence interval
Upper limit
Lower limit

Step 1: Calculate the average.

Let’s say I have all of my respondents’ Likert scale choices in an Excel sheet, then I can just use the =AVERAGE() function to calculate the mean. If you do it manually, simply add up all of the Likert scores and then divide by the total number of scores (11, in this case). For this data, the average came out to 4.27.

Step 2: Find the standard deviation.

As I noted above, if you have all of your survey results in Excel, then you can simply use the =STDEV() function on your data set to calculate the standard deviation. If done manually, then follow the formula below. The standard deviation simply measures the variability between your data points within the sample you collected.

Remember from above that n = the number of people in your study (in this case, 11). If you are calculating the variance for an entire population, divide by n. For a sample of the entire population, divide by n — 1. In this case, we only have a sample of 11 people from a greater population of users, so we would divide by (n-1), or (11–1).

Step 3: Find the standard error of the mean.

The standard error of the mean estimates how much variability there could be between samples, while the standard deviation measures how much each piece of data within your sample might vary.

Step 4: Calculate the confidence interval.

That’s it! You’re ready to calculate the confidence interval now! Finding the margin of error, or how much above/below your mean, will mark the upper and lower limits of your confidence interval. All you have to do is multiply the standard error that you found in step 3 by two.

*Statistical note — if you’re familiar with statistics, here’s where you could get more complicated by digging into how to look up the appropriate z-value for various levels of confidence in order to gain a more precise measurement. For the sake of this article and the practical application of confidence intervals to everyday UX research, I’m estimating the value required for a 95% confidence interval to make it easier.

There you have it! You calculated a confidence interval. You can use this additional data to back up your argument that while your sample size was small (only 11 participants), you can say with 95% confidence that their ratings of your product would fall between a 2.92 and a 5.62 on your Likert scale.

You can use this additional lens to backup any data-driven conclusions you are making in presentations, as well as reassure yourself with reasonable accuracy how conclusive your results might be.

Here’s a great step-by-step tutorial on confidence intervals from Usable Stats, if you want to practice calculating one yourself: http://www.usablestats.com/tutorials/CI

MeasuringU has a lot of great resources on confidence intervals, including: https://measuringu.com/ci-10things/

At IBM in Austin, a group of design researchers meet for lunch a few times a month to discuss research topics of interest. Afterwards, the researchers in IBM Power Systems, the primary conversation facilitators, collect and note the highlights of the conversation. This is one of the series of posts about the lunches from IBM Power Systems researchers.

Stefanie Owens is a UX researcher at IBM based in Austin, Texas. The above article is personal and does not necessarily represent IBM’s positions, strategies or opinions.