Task Recency and the System Usability Scale (SUS)

David Weintraub
6 min readDec 22, 2022

--

The System Usability Scale (SUS)

One area of interest for many user researchers is usability. Jackob Nielsen (2012) defines usability as a system's ease of use and pleasantness. While usability describes an attribute of a system, it is measured as a human experience. As crucial as this construct is to many user researchers, its measurement is more complex. One well-known usability measurement is the System Usability Scale (SUS) (Brooke, 1996). Research participants complete this 10-item questionnaire after interacting with a system. Participants are asked to rate items such as “I thought this system was easy to use” and “I felt very comfortable using this system” on a 5-point Likert scale. Responses are converted to a single score, or usability measurement, between 0 and 100. The SUS has become an industry standard for good reasons.

  1. The SUS produces an intuitive, single score. Despite its complexities, usability is an intuitive concept. So, a single usability score is also intuitive. Systems with high SUS scores should be highly usable.
  2. Since the SUS has become an industry standard, a bounty of past research helps researchers better understand SUS scores. From past research, we know that a score of 68 is average and above 80 is excellent (Sauro, 2011).
  3. The SUS has established construct validity. Participants who complete usability tasks within a system without errors are more likely to give that system a higher SUS score (Peres, Pham, & Phillips, 2013; Sauro, 2012).

Task Recency and the SUS

This article examines whether SUS scores are more strongly associated with task completion of latter tasks, compared to former tasks, during usability testing. Such findings would suggest that SUS scores are disproportionately predicted by the difficulty of the final task(s) during usability testing.

Limitations of the SUS

This article hypothesizes that SUS scores are more strongly associated with completion rates of latter tasks, compared to former tasks, during usability testing. This hypothesis stems from what I perceive as two limitations of the SUS: (1) It is entirely subjective, and (2) it is completed once as a post-test measurement.

  1. The SUS is entirely subjective. Responses are based on participants’ thoughts, feelings, impressions, etc. Whether or not this is truly a limitation is debatable. Since usability is a subjective experience, it could be argued that a subjective measurement is a direct measurement. If you’re interested in subjective experience, ask about one’s subjective experience; however, as you can imagine, measuring subjective experience in this way is not without concern. For example, when provided subjective rating scales, it is understood that survey respondents are often biased to rate things positively (i.e., “acquiescence bias”). I would argue the subjective nature of the SUS suggests that it is susceptible to cognitive factors beyond a system’s usability, such as memory recall post testing.
  2. The SUS is completed once as a post-test measurement. Participants are expected to complete a SUS survey after completing several usability tasks. After all, given that the SUS is meant to describe the usability of a system, research participants should be allowed to interact with the system comprehensively. As a result, SUS scores are likely biased more towards participants' recent experiences with a system (i.e., latter tasks nearer to the administration of a SUS questionnaire). Human memory is limited and susceptible to retroactive interference, and it seems reasonable that such qualities would impact how participants respond to the SUS.

Examining Historic Usability Testing Data

The association between task completion and SUS scores was analyzed using historical data collected by Q2’s Product Research team. The dataset included 133 participants across 25 qualitative, moderated usability tests conducted between 2021 and 2022. The average number of tasks across tests was 5.2. Each participant's usability tasks within tests were categorized as successes or failures. A task was categorized as a success when participants completed the task without assistance from a research moderator; otherwise, a task was categorized as a failure when participants were unable to complete the task without a research moderator's aid.

Average SUS scores were calculated based on whether participants [1] completed the first tasks, [2] failed the first tasks, [3] completed the final tasks, and [4] failed the final tasks. Only the first and final tasks were compared since the total number of tasks differed across tests. The average SUS score from participants who failed the first tasks is 3.9 points lower than that of participants who completed the first tasks. In contrast, the average SUS score from participants who failed final tasks is 9.6 points lower than those who completed final tasks. The difference in SUS scores between failed and completed tasks is nearly 2.5x greater when comparing final tasks vs. first tasks. This analysis suggests that SUS scores are more strongly associated with completion rates of latter tasks than former tasks during usability testing.

A second analysis was conducted further to explore the association between task completion and SUS scores. Average SUS scores were again calculated separately for completed and failed tasks; however, rather than comparing only first and final tasks, this analysis looked at all task positions. Weighted averages were calculated at the following task positions: [1] 6 to 8 tasks from the final task, [2] 3 to 5 tasks from the final task, and [3] 0 to 2 tasks from the final task. Note that task 0 is the final task. Weighted averages were calculated to avoid presenting averages from low data counts since the number of studies with higher tasks is low (e.g., eight tasks from the final task only had six data points).

Like the previous analysis that compared first and final tasks, the association between task completion and average SUS scores is highest when considering only tasks toward the end of usability testing (i.e., 0 to 2 tasks from the final task) (see Figure 1). Within 0 to 2 tasks from the final task, the average SUS score from failed tasks is 11.6 points lower than the average SUS score from completed tasks. This difference is only 3.3 and 1.4 points within the 6 to 8 tasks from the final task and 3 to 5 tasks from the final task weighted averages, respectively. The result of this analysis further supports the argument that final tasks, and tasks near final tasks, disproportionately predict SUS scores.

Conclusion

Based on the results of these analyses, SUS scores are more strongly associated with completion rates of latter tasks, compared to former tasks, during usability testing. These results are significant as they suggest SUS scores are disproportionately predicted by the difficulty of the final task(s) during usability testing. It should be acknowledged that these results are correlational, and an experimental design is warranted before making a solid causal claim. These results stem from two weaknesses of the SUS: (1) It is entirely subjective, and (2) it is completed once as a post-test measurement. Alternative measures should combine behavioral and subjective measurements collected throughout testing. I would also like to reiterate that the SUS has significant strengths, as described at the beginning of the article. So, rather than considering this article as an argument against collecting SUS scores, I would suggest these findings be used to inform your interpretation of SUS scores.

I will end this article by turning to you, the reader. I am curious whether you can replicate similar results using your usability testing data. Please reach out to me either way!

References

Brooke, J. (1996). SUS: A quick and dirty usability scale. In P.W. Jordan, B. Thomas, B.A. Weerdmeester & I.L. McClelland (Eds.), Usability Evaluative in Industry (pp. 189–194). London: Taylor & Francis.

Nielsen, J. (2012). Usability 101: Introduction to Usability. Nielsen Normal Group. https://www.nngroup.com/articles/usability-101-introduction-to-usability/

Peres, S.C., Pham, T., and Phillips, R. (2013). Validation of the System Usability Scale (SUS): SUS in the Wild. Proceeding of the Human Factors and Ergonomics Society 57th Annual Meeting.

Sauro, J. (2011). Measuring usability with the System Usability Scale (SUS). MeasuringU. https://measuringu.com/sus/

Sauro, J. (2012). Predicting task completion within the System Usability Scale. MeasuringU. https://measuringu.com/task-comp-sus/

--

--