Taking Think-Alouds One Step Further

Tony Wang
Zensors MHCI Capstone 2018

--

With a month remaining in our Masters capstone project at beginning of July, we decided to conduct a round of user research that could help shape the future of Zensors development in a way that traditional user testing can’t. At this point in the process we were thinking:

How can we do meaningful research now that we’re done iterating and evaluating on our design?

The answer was something we’ve been interested in studying since we first joined the project:

Identify mental models that lead to success with using Zensors.

We chose two different methods in order to experiment with the best tools for investigating how users’ mental models. Since these techniques were new to our team, we wanted two explore two independent tools for capturing how a user might think about Zensors naively. In this article, I’ll introduce both methods and talk about our experience working with them.

Understanding the Limitations of Think-Alouds

Before we dive into the details of our methods, let’s take a step back and ask an important question: Is there any point to using two independent measures of mental models?

One inspiration for this study was understanding the limitations of think-aloud protocol. Broadly used in usability testing as a way to collect data on what a user is thinking as they use a product, think-alouds are touted by Don Norman as the most popular usability tool. Yet on the same page Norman also writes:

“There are also more advanced knowledge-elicitation methods for gaining deeper insights into mental models, but for most design teams, a few quick think-aloud sessions will suffice” — Don Norman

So what are these techniques and how might we leverage them to benefit the Zensors project?

Part of my research into the cognitive psychology literature landed on the work of Rowe and Cooke (1995), in which the authors were conducting research on the mental models of Air Force technicians. Technicians frequently have to troubleshoot complex systems, and their ability to perform on the job requires formulating a specific understanding of how they would approach fixing the problem. In their work on mental models, Rowe and Cooke explore three different types of techniques: laddering interviews, relatedness ratings, and diagramming. We opted to borrow laddering interviews and relatedness ratings for our own research.

Study Design

We developed two different “conditions” for this test, splitting our audience across two versions of the mental model study: laddering interview + think-aloud or relatedness + think-aloud.

Independent Measure #1: Laddering Interviews

Laddering interviews are a well-known technique used by market researchers to understand the connection between attitudes. Compared to traditional research interviews, laddering interviews focus on connecting surface attributes of behavior or attitude (i.e. actions or thoughts about a product) with underlying belief systems (i.e. product is tied to a core ethical value).

UXMatters.com has an easy to follow introduction explaining the technique. To summarize in one sentence: laddering interviews assume that there is an underlying structure to the way people connect personal values to attributes of their favorite brands. By asking the questions “why” and “how”, you can traverse the ladder to focus on deeper concepts or shallower representations.

We adopted laddering interviews because of their potential for uncovering underlying values in our users. One of the biggest questions we’ve dealt with is the notion of privacy: Do people find what Zensors does to be creepy?

(Hint: They do.)

We conducted laddering interviews with 7 participants in our final few weeks on the capstone project. Our interviews followed the following structure. After the interview, we would conduct a think-aloud to assess performance.

  1. Introduce them to Zensors and provide a brief description
  2. Ask about work habits, current tools, and data needs from the past week to get the participants warmed up
  3. Based on their responses to 2 above, begin probing surface behaviors or actions related to data collection
  4. Ask “why” 20 times without annoying the participant

We coded interview data from each participant using three standard layers: attribute, consequence, and value. Categories were assigned to similar chunks of data. Finally, we structured our categories into a ladder diagram to show the connections between various themes in an interview. Here’s an example of what one interview looked like after analysis.

From this diagram, you can see how “creating and maintaining an efficient and safe environment is important” is supported by a number of related concepts.

Independent Measure #2: Relatedness

This method was a lot trickier for us to implement. A relatedness task can be considered a more structured type of card sort, in which we’re presenting closed options for the user to make judgments about. Rowe and Cooke implemented relatedness exactly this way: in the lab using cards. Since all of our usability testing is remote, we needed to find a way to conduct the study without manually operating cards.

One option is to use an online card sorting tool. With this option, we’d have a structured way of gathering data. The downside is that we’d have to give our participants an external URL outside Google tools (we were using Google Meet) to deal with, thus introducing more complexity to our protocol. The second option was to figure out familiar way that we think most participants would intuitively understand.

Ultimately, we settled on using Google Forms to conduct our relatedness task. We built a survey of 8 features presented in 28 total pairs, and used randomization to make sure there weren’t any effects from ordering on our data. We asked our participants to rate the degree to which two features relate to each other when it comes to impacting his/her ability to optimize conference rooms. (Conference room utilization was a common complaint in our earlier research with facilities managers and coworking professionals.)

We settled on a 6 point Likert scale for measurement because we were worried that our participants, some of whom have expertise with facilities management but not technology, may rely on a middle option as an easy way out. Even if participants were unsure, they had to pick a side of related or unrelated. This meant we could ensure that there was some sort of judgment during our analysis.

Performance Measure: Think-Aloud

Finally, we examined the performance of each participant using a think-aloud task. Instead of using thematic analysis on the data, we instead opted to conduct cognitive task analysis (CTA) instead. CTA is a method developed from cognitive psychology and is useful for more knowledge-based tasks rather than action-based tasks that traditional task analysis examine. I transcribed all of the think-aloud data and worked with the team to identify where errors and successes happened during our tests. Based on criteria such as success or failure to accomplish a task, number of inefficiencies when completing a task, and the ability to verbalize goals, we systematically selected our top performing participants.

Wrapping Up

While I haven’t gone into some of the insights we’re starting to see from this line of research work, I wanted to introduce some of the interesting methods we’ve been experimenting with as part of the MHCI capstone process. As we wrap up our project, we’ll be looking at packaging and presenting more of our research data. More to come!

--

--

Tony Wang
Zensors MHCI Capstone 2018

UX research, online communities, and languages | Masters Candidate in HCI @ CMU