Are you using science in your user testing? Here’s how you can start

Abigail Rumsey
Waitrose & Partners Digital
6 min readApr 4, 2022

--

Photo by Tsvetoslav Hristov on Unsplash

As user researchers, our ‘quick and dirty’ methodologies sometimes don’t leave the time for doing a scientific analysis and we’re just looking for a ‘good enough’ answer. However, I want to show that it doesn’t always take too much additional effort to get an answer that is more scientifically robust. In this article I will be showcasing how I used a few spreadsheet formulas to analyse the task completion times from two first click tests done in Usability Hub. Let’s head into the world of quantitative user experience research (or Quant UXR if you’re in a hurry) and see if we can be quick, dirty AND extra nerdy!

Statistics in research is all about probability — how likely it is or isn’t that something happened completely randomly. As researchers it is helpful, and sometimes critical, for us to say that our results are not just a coincidence but highly likely to mean something. You might have done first click tests before and said something like “Users took, on average, 7 seconds longer to complete the task using Design B”. However, this doesn’t say whether it was just down to chance that a few users did the task quicker with Design A. A statistical test will tell you how likely it is that the difference between the average times was actually different and not just due to chance.

First click tests are an easy place to dive into statistical methods because they are quick to do and you don’t have to do much processing to get the numbers you need out. I often use a first click test to compare how long it takes for users to perform a task between two (or more) designs. I like to use Usability Hub to run first click tests because they make it super easy to recruit participants and you can export the results with one click.

The type of statistical test that you need to do to compare two separate first click tests is a two-sample t-test. This assumes that you have used different participants to test the two different designs. If it was the same participants testing the two designs you would use a paired t-test.

Heads up: further down the page it may start to look complicated but don’t worry it mostly comes down to a couple of formulas that you stick in a spreadsheet, and I’ve shared mine so you can copy them.

How to analyse your first click test results

First things first, you need the raw data. On Usability Hub that means clicking the ‘Export X results to CSV’ button. Open the CSV file in a spreadsheet and you will see that you have a column showing the task duration for each participant in milliseconds.

Just a note that you will only want to be measuring the participants who completed the task correctly, i.e. clicked in the right place. If you used Usability Hub and ran a navigation test where you created hotspots to show where the right place to click was then the spreadsheet should have ‘hit’ columns that say TRUE for a correct click. If you ran a first click test, you have to do extra work to find out which participants clicked in the right place. Hopefully Usability Hub will fix this gaping hole in the data some time soon. In the meantime, a helpful guy called Shane has written all about how you would check whether a click was in the right place or not.

Back to our statistical analysis… Grab the task duration data for the correct participants from each test and stick it in another sheet. Keep it in milliseconds for now. You are going to be comparing the average time taken to complete the task between the two designs so you will need to work out the average or, to be scientific about it, the mean. The top brains of Quant UXR, Sauro and Lewis, recommend taking something called the geometric mean due to time data being skewed (basically humans will never complete a task in the fastest time possible, and some can be extremely slow). I’m sure there’s a formula we can use to calculate the geometric mean but let’s just use the MeasuringU calculator. Paste in your task duration data for each test separately and it will tell you the geometric mean: https://measuringu.com/calculators/time_intervals/

You should now have two geometric means. Count the number of correct participants for each test. Now you can start doing proper statistics!

Doing the calculations

The formula for a two sample t-test, i.e. finding out if there is a significant difference between two averages (where there are two separate groups of participants), is this:

Mathematical formula: t equals open parenthesis x1 minus x2 close parenthesis divided by square root of open parenthesis s1 power 2 divided by n1 close parenthesis plus open parenthesis s2 power 2 divided by n2 close parenthesis

where x1 and x2 are means from samples 1 and 2, s1 and s2 are standard deviations from samples 1 and 2, n1 and n2 are the sample size from samples 1 and 2, and t is the test statistic.

This formula just needs to be translated into a spreadsheet formula, which will provide your t statistic. My formula in Google Sheets looked like this:

=(B20-C20)/SQRT((POWER(STDEV(B2:B18), 2)/B22)+(POWER(STDEV(C2:C14), 2)/C22))

And here’s a Google Sheet that shows it in action so that you can replicate it:

https://docs.google.com/spreadsheets/d/19YEWQKDlG41wlFW53qAoR9hNi2cGiKuYB4Z51krltL4/edit?usp=sharing

Once you’ve got your t statistic you need a p value (I know, this is too many letters now). The p value is the most important part as it gives you the probability that the difference between your two tests was due to chance. You want this to be as low as possible but usually below 0.10, or below 0.05 if you want to be more sure. A p value below 0.10 means that there is only a 10% likelihood that the difference between the tests is due to chance, so conversely there is a 90% likelihood that the difference is not due to chance.

Luckily, there’s a spreadsheet function (TDIST) that you can use to calculate the p value, which is the same in both Google Sheets and Excel. TDIST requires the t statistic, degrees of freedom (the simple version is to add the number of participants in both tests together and take away 2) and tails (for this type of test this is 2).

This is what it looked like in my sheet:

=TDIST(B26,(B22+C22–2),2)

Reporting the results

In my example I got a p value of 0.15, which meant that I could be 15% sure that the difference between my tests was due to chance. In other words it meant that I could only be 85% sure that it was quicker to complete the task using the new design. This is less than the 90% certainty I was aiming for. If I wanted to be more certain I would need to test with more people and run the t test again.

So, how do we report this?

In my report, I said:

“A two-sample t test was used to test whether the difference in time to complete the task between the two designs was statistically significant.

The t test found that we can be about 85% sure that it is quicker to complete the task using Version A. This isn’t a significant difference between Version A and the control because statistical tests usually require at least a 90% confidence.

Recommended next step: A/B test the two designs on the live site to gather further data on ease of completing this task

Have a go at following this method the next time that you are comparing first click tests. It may take some time the first time that you’re working it out but soon you’ll be doing stats like a pro! Let me know how you get on in the comments.

If you are interested in learning more about how you can use statistics to more accurately report on quantitative user data, I highly recommend the book Quantifying the User Experience by Jeff Sauro and James R. Lewis.

--

--