Embracing the Power of Statistics in Usability Testing: Insights from a UX Designer | Part 2

Published in

Design@ING

16 min readJul 31, 2023

Buckle up, because we’re about to level up our knowledge journey! In the previous chapter, we took a comfy stroll through the basics of statistics, laying down a rock-solid base for you. Now, we’re going to step it up and wander into a more exciting territory. But hey, don’t sweat it! I won’t throw a truckload of tricky stuff at you. Our main goal? To arm you with the key tools for winning at quantitative usability testing.

In this sequel, we’re going to get our hands dirty by learning how to choose just the right statistical test. To make this as real as possible, we’ll be digging into some examples from everyday life. We’ll also crack open the mystery of sampling and explore how it can switch up our view on statistics.

Hypothetically speaking, mentioning this is necessary

When we set out to run a test, we always carry a hunch or two, even if we keep them to ourselves. It could be as simple as believing users will act quicker if we move a button to the top of a webpage or guessing that Design A will outdo Design B by 10%. We call it a hypothesis.

Let’s delve into hypothesis testing, which primarily swings around two core ideas: the null hypothesis and the alternative hypothesis. Picture the null hypothesis as the status quo, where everything is business as usual. The alternative hypothesis, on the other hand, is our object of curiosity, hinting at something fresh and significant happening. We put this method to work to hunt for proofs that question the null hypothesis and lend weight to the potential of the alternative hypothesis.

Even though it might be tempting to prove the null hypothesis wrong, we must always remember that our mission in hypothesis testing is to uncover the truth, not to cling onto our initial guesses. It could be a bit of a letdown to find out that our original assumptions not actually true. However, embracing these results is vital in making data-driven decisions.

Hypothesis testing isn’t bulletproof either. It can stumble upon two main types of mistakes: Type I and Type II errors. A Type I error, or a false positive, pops up when we wrongly dismiss a true null hypothesis. On the flip side, a Type II error, or a false negative, surfaces when we overlook rejecting a false null hypothesis.

To give you a clearer picture, in usability testing, Type I errors might sprout from bugs in our test, causing some participants to struggle with finishing the task. Meanwhile, Type II errors could creep in when a group of participants sail through the task, but the app drops the ball on logging their progress. These are just samples, as a wide variety of reasons could set off these errors. To cut down on these errors, it’s crucial to take into account the specifics of your statistical analysis.

Let’s choose the right test!

Alright, time to tackle an exciting question: how do we pick the best test for our dataset? A few key considerations can guide us to the right answer.

Types of statistical tests in usability testing

Is your numerical data normally distributed?

Remember the numerical data we chatted about in part 1 of this series? Like how long it takes to complete a task? We need to find out if this data fits into a pattern we call ‘normal distribution’. If it does, great! We can roll with something called a parametric t-test. But what if our data isn’t playing by the ‘normal distribution’ rules, or we haven’t checked? No worries! We can switch gears to the non-parametric counterpart, which goes by the name of the Mann-Whitney U test.

Did you test with two different groups?

Next, we’ve got to figure out if we’re dealing with two separate groups in our test. This brings us to another consideration: is our data paired or unpaired? In user experience studies, using paired tests can stir in some biases we’d rather avoid. Say, a participant might learn a thing or two from the first design they see, and this could color their reaction to the designs that follow. One way to sidestep this is by mixing up the order in which designs are shown. More often than not, in usability testing, we turn to unpaired tests to contrast different groups of participants or to compare separate studies. So, let’s keep these factors in mind as we advance on our statistical journey!

Example for numerical data

A study I conducted recently revolved around direct debit. Here, my aim was to compare and contrast two distinct designs via a usability test. I started by conducting a ‘normality test’ — a statistical technique used to understand whether our dataset follows what we call a ‘normal distribution’. Imagine a perfectly balanced seesaw, where the data points on either side are evenly distributed. That’s what a normal distribution looks like!

In this case, my normality test told me that our dataset was a bit off-balance. It did not follow a normal distribution. So, I switched to a different statistical tool, the Mann-Whitney U test. This test doesn’t assume that data is normally distributed and it can help to identify any significant differences between two groups — or, in our case, two designs.

In this testing world, our good old H0 stands for a ‘no difference’ scenario between the two designs. It’s our neutral starting point before we gather any evidence. My aim was to compare the designs by focusing on ‘time on task’ — the amount of time users took to complete tasks using the designs. To my surprise, users were quicker with Design 2 than Design 1.

The final piece of the puzzle is the ‘p-value’. Think of it as a measuring stick for our significance level, often represented as α. If the p-value is below 0.05, it’s like a green light indicating there is a meaningful difference between the designs. In other words, it supports our ‘alternative hypothesis’ — the idea that something interesting is going on between the two designs. With the p-value below 0.05 in my test, I concluded that Design 2 was indeed significantly faster than Design 1. So, based on this evidence, I decided to move forward with Design 2.

Does your data identify as categorical?

Another consideration you need to remember about is that if the data you’re dealing with is a ‘categorical data’. Think conversion rates — did a website visitor make a purchase, or did they leave without buying anything? To dig into this, our the go-to statistical test is the chi-square test. This test is like our very own detective, helping us to uncover if there’s a significant link between different categories.

But, before we jump into using the chi-square test, there are a couple of things to keep in mind. Firstly, you need a decent group size, usually about 50 or so, for this test to work well. Secondly, you want at least five expected observations in each cell of a little thing we call a ‘contingency table’. Don’t worry, we’ll dive into what this table looks like in just a bit.

Now, what happens if these conditions are a bit out of reach? Well, there’s another approach known as Fisher’s exact test. It’s a handy alternative that steps up when the chi-square test isn’t the best fit. So, keep these points in mind when you’re working with categorical data!

Example for categorical data

Let’s imagine you’re testing a new mobile app, with data pouring in about the success or failure of a task. This data revolves around two main factors: how tech-savvy the users are (high or low digital proficiency), and what type of device they’re using (iOS or Android). The burning question in our minds is: do these factors interact significantly with each other?

Contingency table of a digital proficiency of iOS and Android users

Welcome to the world of the chi-square test, where we scrutinize the association between rows and columns in something called a ‘contingency table’.

So, what exactly is this contingency table? Well, think of it as a grid where you tally the number of people falling into each combination of our two factors: digital proficiency and device type. For instance, how many high proficiency users are on iOS, how many on Android, and so on.

To work out the chi-square value, we calculate ‘expected values’, which are the values we would expect to see if there was no association between our two factors. We get these expected values from a bit of multiplication and division involving the row and column totals. For example, the expected value for high proficiency iOS users would be (232 * 208) / 394 = 122.5.

In our imaginary test, we end up with a chi-square value of 0.258. This value suggests that the link between the variables is pretty weak or non-existent. The p-value, or the probability that we’d see these results if there were no real association, comes out as 0.6114. Because this is more than 0.05, we say that the association is not statistically significant. So, it seems like it doesn’t matter whether our users are on iOS or Android, or how tech-savvy they are — these factors don’t seem to interact significantly.

Let’s bring this closer to home. Suppose you want to know if there’s an association between whether your customers who with ING or not, and whether they prefer design A or B of your app. You can use the same chi-square test to uncover any underlying associations!

Where you can find these tests?

You can find information about these tests readily available on the internet, and many resources offer them free of charge:

➡️ Omni Calculator
➡️ Statistics Kingdom
➡️ Social Science Statistics
➡️ Good Calculators
➡️ Graph Pad
➡️ G*Power — free tool to compute statistical power

A heroes of a day the Likert Scale and Semantic Differential

We use opinion scales quite often in our usability test but sometimes don’t know what to do with it. Platforms like SurveyMonkey offer nifty ways to summarize feedback scales, like the popular “top 2 box” score, where you add up the top two response options, such as “extremely likely” and “very likely.” But let’s delve deeper into another intriguing technique — the “weighted approach.”

Before we jump in, let’s pause for a word about ‘statistical significance.’ It’s the backbone of any data analysis, helping us figure out if the patterns we see are legit or just chance happenings. It’s like a reality check for our results!

Let’s illustrate with an example. Imagine we’ve asked people testing two separate designs, A and B, a SEQ (Semantic Differential) question. Since the SEQ score represents categorical data, our trusty chi-square test will help us determine if our findings are significant. But we’re not dealing with a typical contingency table here, nor are we looking for associations. Instead, we’re going to use a different type of chi-square test called the “goodness of fit.”

The “goodness of fit” test is our yardstick for comparing what we thought would happen (expected data) versus what actually happened (observed data). There’s a key player in this test. This single number gives us a quick snapshot of our test results.

Table of observed and expected results for SEQ for two designs

For design A, our test statistic (χ² or chi-square if you like) comes out as 10.2. This number is a big deal because it helps us compare our data with our expectations.

How do we make this comparison? We use something called the “region of acceptance,” a zone of values where the null hypothesis seems reasonable. For design A, this region stretches from negative infinity to 9.4877, our critical value. You can find critical values with statistical software, online calculators, or a chi-square distribution table.

If the test statistic lands within the region of acceptance, our data lines up well with our expectations, hinting that our null hypothesis might be right. In this case, though, our test statistic (10.2) falls outside the region of acceptance, suggesting our observed data doesn’t quite match what we expected. Which is great!

On the other hand, for design B, our chi-square is 2.4, which does fall within the region of acceptance. In simpler terms, our test participants seem neutral on whether the website is easy or not to use.

Let’s spice things up with the weighted method, assigning numerical values to each response option. It’s like giving each response a score, allowing us to calculate a weighted average. Since design B’s results weren’t statistically significant, let’s focus on design A. Assigning a weight of 1 for “strongly disagree” and up to 5 for “strongly agree”, we get:

(161)+(142)+(53)+(54)+(10*5) / 50 = 2.58

Our weighted average lands near the “neither agree nor disagree” option. This technique gives us a more nuanced snapshot of the overall sentiment, considering both the number of responses and the intensity of opinions. It’s a rich and precise way to distill the feedback from our usability tests!

Sample misconception

Sample size — how big, how small? It’s a hot topic, like choosing the right cake size for a birthday party. Some folks say, “Let the specific test dictate the size!” Others insist, “Five participants should do the trick.” Then, there are those who follow a rigid sample size for each test as if it were a secret recipe.

But, while everyone’s arguing, they’re overlooking one important ingredient: representativeness. It doesn’t matter if you have 5, 50, or 1000 participants — if they don’t represent your target audience, your test is like baking a chocolate cake for someone allergic to chocolate. As Jeff Sauro emphasizes in his book, “Quantifying the User Experience,” when it comes to improving snowshoes, insights from 5 Arctic explorers beat those from 1000 surfers, hands down. So, the lesson is clear — never, ever lose sight of representativeness when picking your sample size

To give you more specific example. Imagine ING wants to spruce up its online home loan application process. They run a usability test to ensure the platform caters to all potential customers. Who they pick for their test becomes a big deal.

What if they only use a small group of bank employees, insiders who understand banking lingo and processes? They’d be relying on what we call ‘proxy users.’ And these folks might not have the same experience as an average Joe navigating the system.

To capture the full picture, the bank needs to rope in a mix of potential home loan applicants from across Australia — people of different ages, incomes, education levels, geographic locations, and both first-time and repeat buyers. This could be achieved through random selection or targeted recruiting to ensure all groups get a fair representation.

Sampling isn’t just about grabbing a bunch of participants — it’s about getting the right participants.

We could talk all day about sampling representativeness, but let’s save that for another article. If you’re keen to learn more, you can check this out: https://measuringu.com/sampling-s/

Now that we’ve selected the right participants, let’s ponder the size of the sample. The battle between statistically-determined and rigid approaches rages on. The decision often boils down to the context, resources, and objectives of the usability test.

Opting for a fixed number, like “30 people,” can be a practical workaround when resources are tight. Whether due to budget, time, or logistics, large-scale usability tests aren’t always doable.

This convention has been widely accepted in usability testing, largely due to its practicality and tradition.

However, for high-stakes projects, statistical methods usually take center stage, providing more reliable results. And as this article is all about how statistics shape usability testing, let’s dig a little bit deeper.

I’ll aim to keep the explanation as straightforward as possible, considering there’s a wide array of methods to determine sample size. Please note that the examples provided here are not exhaustive and merely represent some of the many approaches you could take.

Firstly we need to understand the distinction between types of usability evaluation. Experts often differentiate between formative and summative evaluation, but we’ll keep it simple. Before you decide your sample size, you need to identify the purpose of your usability evaluation:

Finding usability issues or user requirements

You want to identify problems users might encounter, and determine which elements in the interface are the culprits. It’s all about finding and fixing the hurdles users face.

Choosing a sample size for this usability evaluation is like planning a surprise party. Follow these steps for success:

Determine the minimum percentage of issues you want to spot (like deciding how many friends must attend the party).
Choose your desired confidence level for spotting these issues (80%, 85%, 90%, or 99%).
Use the binomial probability formula:

Sample Size = log(1 — Probability of Spotting It) / log(1 — Chance of It Popping Up)

For example, to catch problems affecting at least 5% of users with 95% confidence, you’d need around 59 participants.

Estimating parameter and benchmarking

You’re aiming to measure certain features, like the completion rate, average task completion time, or the perception of usability and how the interface’s performance measures up against a predefined benchmark or standard.

But regardless of whether you’re estimating a proportion (binary data) or a mean (continuous data), depends on the standard deviation of the population, the desired level of confidence, and the desired margin of error.

When estimating a proportion, like a completion rate, you can use the following formula:

n = (Z² * P * (1-P)) / E²

When estimating a mean, like the average task completion time, you can use the following formula:

n = (Z² * σ²) / E²

n is the sample size.
Z is the z-score associated with your chosen confidence level
P is the estimated proportion in the population. If you don’t know, you can often use 0.5.
E is the margin of error you’re willing to accept
σ is the standard deviation of the metric in the population. If you don’t know it, you might be able to estimate it based on a pilot study or similar previous studies.

Examples

Suppose you are conducting usability testing for a new website feature, and you want to estimate the proportion of users who can complete a task without assistance. You want to be 95% confident in your estimate and have a margin of error of 5 percentage points.

Assuming that you don’t have a good estimate for the proportion who can complete the task (P), it’s often recommended to use 0.5 as a conservative estimate.

So we have:

Z = 1.96 (for 95% confidence level)
P = 0.5 (maximizing required sample size)
E = 0.05 (5 percentage points)

Plugging these into the formula:

n = (Z² * P * (1-P)) / E² = (1.96² * 0.5 * 0.5) / 0.05² = 384.16

We always round up in sample size calculations, because you can’t have a fraction of a person in your sample. So in this case, you would need a sample size of approximately 385 users.

Now, let’s look at an example of estimating a mean, such as average task completion time:

Suppose you’ve run a pilot test of 10 users and found that the standard deviation of task completion times is around 20 seconds. You want to be 99% confident in your estimate, and you want your estimate to be within 5 seconds of the true mean.

So we have:

Z = 2.58 (for 99% confidence level)
σ = 20 (standard deviation)
E = 5 (5 seconds)

Plugging these into the formula:

n = (Z² * σ²) / E² = (2.58² * 20²) / 5² = 267.8472

Again, we round up, so you would need a sample size of approximately 268 users.

Making comparisons

You’re comparing two or more interfaces to find out which one comes out on top in terms of completion rates, task times, or satisfaction scores.

When it comes to drawing comparisons, consider these key elements:

The magnitude of the difference you’re aiming to spot.
The relationship between subjects — whether you’re looking within a single group over time (paired), or comparing two different groups (unpaired).
The power to find a difference, usually set at a robust 80%.

When it comes to comparison studies between and within groups, the sea of numbers can seem overwhelming. There’s a tool designed to simplify these processes for you. Instead of presenting you with complex computations, I’d suggest trying G*Power.

Below you can find a how this tool calculates sample size for continuous data (if you’re about to test time on task) for within and between subject groups.

For within subject group you’ll need to select below options. Our input parameters are usually as a rule of thumb for Power 0.8, for alpha 0.05 and for effect size we use Cohen’s d which falls between 0.2 as small and 0.8 as large effect size.

Within subject sample size calculation for continuous data

For between subject group you’ll need to select below options. Below example shows the same input parameters as the previous one, but this is strictly related to your study and your parameters can be different. Allocation ratio talks about how would you like to allocate sample size for each group.

Between subject sample size calculation for continuous data

Try yourself!

And so we wrap up, underlining the indisputable importance of statistics in usability testing. Just as the magic mirror in fairy tales, statistics provide us with an unbiased, reliable reflection of the real-world performance of our design, reinforcing the insightful stories we gather from user behaviors and feedback.

With a statistical lens, we can dive deeper into the data, unveiling the hidden gems of insight that guide our decision-making and fuel improvements. It’s like solving a fascinating puzzle, where statistics reveal the patterns, highlight the trends, and spotlight the critical differences in user behavior. This in-depth understanding allows us to pinpoint usability issues, prioritize enhancements, and gauge the effectiveness of our designs.

In the vibrant, user-focused world we inhabit today, harnessing the power of statistical analysis is key to ensuring our products are usable and successful. A blend of qualitative and quantitative methods leads to a balanced recipe that delivers designs that captivate and engage users, steering us towards our objectives. So, let’s celebrate the critical role of statistics in usability testing, which transforms our designs into experiences that truly spark joy and satisfaction in users!