Embracing the Power of Statistics in Usability Testing: Insights from a UX Designer | Part 1

Published in

Design@ING

8 min readJun 27, 2023

As a UX designers, our primary focus centres on creating delightful and intuitive experiences for users. At the heart of our craft lies an unwavering commitment to making informed decisions about our designs.

I have come to realise that statistics play a crucial role in understanding user behaviour and making data-driven design decisions. Now, you might be wondering, “But aren’t statistics the domain of statisticians?” True, I may not have a formal background in statistics, but through years of experience and a genuine curiosity, I have developed a profound appreciation for the role of statistics in usability testing.

In this blog post, I aim to shed light on why statistics matter in the realm of quantitative user research and how it can empower us as UX designers to create even better products.

When initially crafting this article, I must admit that my first attempt consisted of a hefty information overload regarding statistics. It included detailed explanations of which test to choose based on the type of experiment conducted and even delved into the nitty-gritty of calculating various statistical measures. However, upon reflection, I realised that such an approach could easily overwhelm readers. Instead, let’s take a step back and begin with the fundamentals.

What the heck descriptives are all about?

When exploring into the world of statistics, it’s important to start with something referred to as descriptives. They play a crucial role in summarising and describing our data.

Even if statistics isn’t your cup of tea, chances are you’re acquainted with the notion of the mean. You might have even found yourself calculating it a couple of times while carrying out usability tests with SEQ. Essentially, the mean boils down to being nothing more than an arithmetic average.

The concept of the mean is classified under a broader term known as central tendency, and if I were to ask you about your understanding of central tendency, you would likely hit the mark. It refers to the value positioned at the centre of our dataset. Central tendency includes other measures such as the median and mode. However, when it comes to usability testing, these measures are rarely used or almost entirely disregarded.

But what we do use quite often to describe our dataset is variance and standard deviation (SD).

Allow me to provide a brief overview of the latter, as it tends to be favoured more frequently over variance, because it has the same unit of measurement as the dataset you’re assessing. This specific characteristic makes it more intuitive and easier to relate to the original data.

Standard deviation measures how spread out a set of data points is. It tells us how much the individual data points differ from the mean. Think of it this way: Imagine you have a group of numbers, like 5, 7, 8, 10, and 12. The mean of these numbers is 8.4. The standard deviation would tell you how much these numbers tend to differ from that mean.

In the context of usability testing, let’s say task completion time. A high standard deviation in task time indicates a greater variability among participants in completing the task. This suggests that some participants may have found the task more challenging or encountered difficulties, resulting in longer completion times. It could indicate potential usability issues or inconsistencies in task design or instructions.

Example 1. Calculating standard deviation
Below is a dataset capturing the task completion time of 10 users. Although quantitative research typically involves larger participant numbers, for the purpose of this example, I have used a smaller sample size.
49.2s, 34.5s, 50.3s, 60.2s, 45.8s, 43.9s, 51.1s, 44.0s, 52.6s, 39.9s
Mean for this dataset equals: 47.15s. Now we need to subtract the mean from each number and square the result. For the first data it will look something like this (49.2–47.15)² ≈ 4.18.
Now, calculate the mean of these results. The sum of these results is 464.76. Given the sample size of 10, dividing this sum by than number gives us 46.476. Finally, taking the square root of this value gives us a standard deviation.
Standard deviation = √46.476. ≈ 6.817.
Based on that, you can conclude that users typically take between 40.3 seconds and 54.0 seconds to complete the task.

But wait, there is more…

Descriptives that I mentioned earlier can show us hidden patterns, trends and help identify relationship within the data. The same data can help you to understand the data distribution.

Among the various distributions, the most common is a normal distribution. At this point, I need to touch the topic of data types. Although we will explore this subject in greater detail later, let me shortly introduce numerical and categorical data.

When analysing these data types, we encounter parametric and non-parametric data analysis approaches. Parametric data refers to datasets that exhibit a normal distribution, characterized by a histogram with a bell-shaped curve. On the contrary, non-parametric data lacks this characteristic bell-shaped histogram and does not conform to a normal distribution.

Assessing whether your dataset follows a normal distribution can be as simple as taking a visual look at it. However, if you’re unsure, you can always employ a normality test for a more definitive answer. Ideally, having a dataset that follows a normal distribution makes data analysis much easier. But the true is that usability not always follows a normal distribution.

Example 2. Normality test
Getting back to our task on time we can employ Shapiro-Wilk test [1] which is a standard test of normality.
In that specific test, our focus will be on validating the hypothesis that the data adheres to a normal distribution. By doing that we can also anticipate to reject that hypothesis. In order to achieve this, we need to establish specific measures that provide guidance on when to accept or reject the hypothesis. This measure is a significance level, usually estimated around 0.05 or 5%.
Any outcomes that falls above significance level will support our hypothesis, while conversely, any results below the significance level reject hypothesis of normal distribution.
Through employed Shapiro-Wilk test we can calculate probability value, also known as p-value. This value measures evidence against our hypothesis of normal distribution. The greater the p-value the harder is to reject our hypothesis of normal distribution.
For our data set the p-value equals 0.9966. Which strongly proofs that our dataset of task time is normally distributed.

The reason why we are checking if our data follows the normal distribution is because the choice of statistical test depends on your data type and its distribution. By understanding whether your data is parametric (follows normal distribution) or non-parametric, you can select the appropriate statistical test that aligns with the nature of your dataset. This consideration ensures accurate and meaningful analysis.

Getting acquainted with data types & scales of measurement [2]

Now you understand statistical descriptives and normal distribution and how to include them in your usability testing data analysis. As you continue beyond this paragraph, I assure you that you’ll feel empowered with newfound knowledge to use it in your professional endeavours.

As I mentioned before, data types can be broadly classified into two main categories: numerical and categorical. Numerical data further divides discrete and continuous, which further divides into ratio and interval data, providing a quantitative framework for analysis.

On the other hand, categorical data encompasses nominal and ordinal subcategories, capturing qualitative information that adds depth to your findings.

In this article, what we really going to explore are the properties that help determine how data should be analysed, which we refer to as scales of measurement. However, for the sake of simplicity, we can use the terms “scales” and “data types” interchangeably.

Numerical data, in general, is continuous, meaning that you can choose any two arbitrary numbers and always insert another number between them. The most common type of numerical data is ratio data, typically observed in natural measurements like lengths, heights, widths, or time-related metrics such as task duration or transaction value.

In interval data, the intervals between data points are equal, but there is no true zero. To get a grasp understanding of this type of data, think about the comparison of temperature measured in degrees Celsius or Fahrenheit. But you don’t need to stress out, we won’t encounter interval data frequently in usability testing.

Another type of data is categorical. It divides into two subcategories: nominal and ordinal. Nominal categories are arbitrary and can be labelled or numbered in any way without possessing any meaning. For example, we could assign the label 1 to Android and 2 to iOS for user platforms. The numbers hold no quantitative relationship; their purpose is solely to distinguish between different categories.

And we also have ordinal categories that focus on a specific ordering, often represented by numbers. Examples of ordinal values include the days of the week or months of the year. December naturally follows November, and we can assign them numbers from 1 to 12 to represent months.

During usability testing, a prime illustration of ordinal data is the Likert scale. This scale allows participants to provide their responses by assigning a numerical value to options such as “strongly agree” (scored as 7) or “strongly disagree” (scored as 1).

By grasping above concepts of data types you’ll be able to choose correct statistical test which ultimately help you to analyse your data sets and make informed data-driven decisions about your designs.

In this article, we’ve explored the basics of statistics within the context of usability testing. We touched upon fundamental concepts such as the descriptives and normal distribution. We also discussed the importance of understanding data types and their scales of measurement. While this article provided a solid foundation, there’s still more to explore!

In the upcoming second part, we’ll dive deeper into the topic. We’ll unravel the intricacies of selecting the appropriate statistical test, examine real-world examples, and delve into the nuances of sampling. So, stay tuned for an insightful continuation where we’ll uncover the practical aspects of statistical analysis in usability testing.

[1] Shapiro-Wilk test calculator: normality calculator, Q-Q plot (statskingdom.com)
[2] Scales of Measurement (studyonline.unsw.edu.au)

Embracing the Power of Statistics in Usability Testing: Insights from a UX Designer | Part 1

What the heck descriptives are all about?

But wait, there is more…

Getting acquainted with data types & scales of measurement [2]

Written by Adrianna Modrzyńska