Learning overview

Combined metrics to compare across different designs or users

Joline
5 min readNov 19, 2023

Various metrics can be tied to different goals, ranging from performance, efficiency, learnability, engagement, awareness, conversions, and more. Once we’ve gathered user feedback for a deliverable, the challenge lies in encapsulating these results into a holistic metric that offers quick and preliminary understanding. The process involves combining each participant’s results into a holistic score and further comparing the observed results across different designs or products, against a target or competitor’s performance. If we’ve collected several metrics for 20 participants, and we have two sets of such data to compare, how could we provide comprehensive metrics to demonstrate which works better, which area is most problematic, and what factor has (or does not have) an effect to an aspect of the experience? In this article, I want to summarize some tips I learned from Measuring the User Experience.

Combining Metrics Based on Target Goals

Data points should be aligned with predefined target goals. For a study, we might have primary, secondary, or other goals. We should determine which will be the focus and compare the observed results against the objective. A useful metric in this context is the percentage of users who achieved the combined set of goals. For tasks-oriented studies, where measures like task completion, task time, and errors are relevant, evaluating whether participants met our stated goal becomes crucial. We can establish standards for comparison, based on historical data, expert benchmarks, competitor standards, or a business goal. For example, if we are most concerned about task time. The ideal target might be “at least 90% of tasks completed under 70 seconds”.

Combining Metrics Based on Percentage

When expert or competitor standards are absent, insights from participants become valuable. We can define the minimum maximum, and observed values for each metric tested in a study. The observed value comes from the participant. The maximum value can come from the participant who gives the best performance or experts as a reference. The minimum value can come from the participant who gives the worst performance, or our understanding from previous experiences. For example, if we collect ease-of-use ratings on a 5-point scale, we can convert a participant’s rating into a percentage by simply dividing the person’s rating by the maximum rating. In all similar cases, we can divide the score obtained from each participant by the corresponding maximum, whether it be 15 tasks in terms of task completion rate, or a highest satisfaction rating of 7 as very satisfied. However, In cases like the average time per task, there might not be much dramatic differences across all participants’ task times, thus making it difficult and inaccurate to select one’s slightly better performance as the maximum value. In this case, we can divide the difference between the maximum value and the overserved value by the difference between the maximum value and the minimum time found in a study. In the case of task time, the normalized task time percentage will be as follows:

*If reversing is not required, we don’t need to subtract the second part of the equation from 1.

We first find the difference between the longest time and the observed time in the study. Then we define the total range, which is the difference between the longest and shortest time performed by participants in the study. Additionally, since a higher task time indicates worse performance, we need to subtract the result from 1 to reverse the scale, where higher values now indicate better performance. This will show us how close the observed time for a participant is to the best performance in comparison to the entire range of performance from best to worst. In the case of page visits, A higher number means closer to the best performance (e.g. visiting many pages on a website), and a lower number indicates worse performance (e.g. visiting a few pages and leaving). After we generate a percentage score for each metric, we can average all metrics’ scores to get an overall percentage for each participant.

We might encounter two problems. First, for some metrics like task time, a higher number indicates worse performance. Second, we might need to apply weights to metrics based on their importance. Let’s explore the scenarios.

1. When a higher score indicates worse performance

We should ensure each percentage derived from each metric follows the same scale — a higher number is interpreted as better performance and vice versa. If not, we should reverse the scale accordingly. For example, we have an ease of use rating from 1 to 5. 1 means “very easy”, 5 means “very difficult”. Simply subtracting the rating from 5 will reserve the scale.

2. Applying weights to metrics

Let’s say we give both the “task time” and “number of tasks completed” a weight of 1 and assign a weight of 2 to “ease of use ratings”:

Weighted Average = 1×Normalized Task Time + 1×Normalized Tasks Completed + 2×Normalized Ease of Use Rating

*Note that we should eliminate outliers for more accurate results

Combining Metrics Based on Z-Scores

The Z-score tells us how many standard deviations above or below the mean a score is. It is a useful tool for comparing sets of data. We need to convert each metric to its z-score, and then calculate the average z-score for each participant so that we can use the information for comparing two versions of a product, different user groups, or conditions within the same usability test. Once all data points are transformed into z-scores, we can plot the z-scores from two sets of data on a graph and explore the steepness and direction of the linear lines. The direction of a line suggests if the variables of a study presented in the graph have a negative or positive correlation. We can also compare two lines, in scenarios where a baseline study is complemented by a subsequent study that specifically makes changes to a variable. By observing whether the two lines are parallel or divergent, we learn if the tested variable has an effect on the results.

Excel’s STANDARDIZE() function to calculate z-scores
The textbook shows a similar chart. Here I asked Bard to generate a similar chart for me lol.

Using a Single Usability Metric (SUM)

Finally, we can combine various metrics into a single usability score. SUM (single usability metric) creates a standardized and easily understandable view of the overall usability performance. The approach can be used to identify the best and worst performing results in a specific task, pinpoint the most problematic tasks, and offer a holistic view of suability performance. You can find additional resources on this approach here: SUM: Single Usability Metric

--

--