Sprint 4: Understanding Human-AI Collaboration — Insights from Hands-On Experiments and Research Synthesis

Honda Research Institute MHCI @ CMU
99P Labs
Published in
7 min readApr 10, 2024

Written by the 2024 99P Labs x CMU MHCI Capstone Team
Edited by 99P Labs

The 99P Labs x CMU MHCI Capstone Team is part of the Master of Human-Computer Interaction (MHCI) program at Carnegie Mellon University.

Catch up on Sprint 3 here!

Returning from Spring break, we immersed ourselves in exploring potential solutions by conducting secondary research on team performance measurement and executing two in-person teaming experiments. In sprint 4, our primary goal is to develop a pretotype showcasing the prospective contents of a “living research guide” for Human-AI Teaming (HAIT) research, which we plan to test in the upcoming round of experiments. We conducted two experiments with different contexts: one focused on identifying metrics within Human-AI Teaming through a consumer’s use case, and the other through a researcher’s use case.

Secondary Research

Whiteboard with a Conceptual Model of variables involved in HAIT research
Conceptual Model of variables involved in HAIT research

Before delving into experiment design, we aimed to understand current research metrics and methodologies used in HAIT. Despite HAIT’s longstanding popularity over the past 30 years, a standardized methodology is lacking, hindering researchers’ ability to build upon existing findings. Nevertheless, we compiled a list of common independent and dependent variables frequently explored in HAIT research through an additional literature review.

We synthesized numerous artifacts and models that address key questions:

  • What are common independent variables controlled in HAIT research?
  • What are common dependent variables, typically objectives, controlled in HAIT research?
  • What are some common metrics researchers use to measure the dependent variables?
  • What are the methods to measure these metrics?

Probing question to guide our future direction:

How might we format the variables as a guide to help researchers tasked with designing HAIT studies?

Consumer-Facing Pilot Test

To explore human-AI teaming dynamics in a real-world context, we conducted a Wizard of Oz experiment simulating a classroom navigation AI. Our goal was to observe how mistakes and confusing directions from the AI would influence a user’s trust in the AI. To create this experience, a team member acted as the “AI,” guiding a participant through CMU’s confusing buildings using a pre-written script.

Process

We employed a Wizard of Oz method to simulate a classroom navigation AI experience without the need for fully developed artificial intelligence. This allowed us to focus on how users reacted to the concept of an AI assistant. The experiment involved three distinct navigation scenarios:

Simple navigation map showcasing the simple route a person would take through the building.
Simple navigation route model used in the pilot study

Initially, a straightforward route was designed to establish a baseline level of trust. This route directed users to a familiar destination, laying the foundation for trust assessment. The second route introduced an intentional error by the “AI,” focusing on how transparency influences trust, exploring how transparency impacts trust, and observing participants’ behavior closely. Lastly, a deliberately confusing route tested reactions to a less competent AI.

Throughout the experiment, participants had the freedom to interrupt the AI for clarification or question its directions. Following the completion of the routes, participants were asked to complete a questionnaire evaluating their subjective feelings of trust towards the AI.

Findings

Interestingly, we discovered that when the AI openly acknowledged its mistakes and promptly corrected them, it instilled greater trust. This underscores the significance of transparency in human-AI interactions. Additionally, we noted a correlation between participants’ perceived trust levels and their questioning behavior. When participants felt confident in the AI’s capabilities, they posed fewer questions. They engaged actively and followed the AI’s instructions without constant verification. Conversely, when faced with unclear instructions or a sense of disorientation, participants tended to ask more frequent and probing questions, indicating a decline in trust. They sought reassurance through confirmatory queries like “Are you sure this is the correct way? Is this the right turn?”.

This pilot test provided valuable insights into the intricacies of assessing qualitative metrics and advancing toward the objective of translating qualitative data into quantitative measures.

Reflections

More complex navigation through a building as an update based on our experiment observations.
Updated navigation map

Given the insights gleaned from our preliminary pilot test, we’ve refined our study approach. Firstly, we’ve eliminated potential human influences on participant performance, such as the researcher’s presence and voice acting as the AI navigator. While technical constraints necessitate the continued use of the Wizard of Oz method for the AI system, we intend to employ text-to-speech applications during video calls to guide participants through scripted interactions. Secondly, we’ve redesigned the experimental design, shifting to a between-subject study format. This adjustment streamlines the comparison of participants’ trust in AI, focusing on its transparency in a more impartial manner.

Researcher-Facing Pilot Test

Introduction

In addition to a Human-AI teaming (HAIT) task that might be performed in the consumer-facing space, we structured a HAIT task that might be performed in productivity or professional settings. In this experiment, we had participants imagine they had to plan a party in ten minutes.

Process

Our interview process began with a pre-activity survey with two simple questions regarding the participant’s experience with party planning. We wanted to standardize our results across the board and thought it would impact the way we measured the Trust of the System. We then found it reliable to read out the prompt that we would have the users work on

Amid the Martian landscape, there’s a big dome where, you, a friendly Martian, are throwing an out-of-this-world party. You have just sent out invitations using Martian communication tech, inviting fellow Martians and even visitors from other planets. This party is going to be the talk of the century, a wild fun night the stars are witnessing from above! How would you go about planning it? Luckily, you don’t need to do this alone — your trusty [AI helper, Janet] will be helping you make this event come to life! We would like to hear you talk through your process as you work together to make this party come to life!

Following this prompt, we asked our participant questions to assess their current level of trust in using an AI to conduct similar or desk-based tasks. We also wanted to assess their comfort or familiarity with using AI to create more generalizable results.

Picture with researchers and participant conducting pilot testing
Pilot experiment with researchers present and participant explaining his outcomes with think-aloud process

With this we told our participant that they had 10 minutes to come up with a plan and we requested that they talk through their process with us. To make sure we could observe our participant better, we made sure to give them their space and asked if we could record them doing the activity. This helped put them in a mental space closest to a desk environment. Additionally, we encourage our participant to utilize the audio/verbal GPT4 feature to communicate their thoughts and ideas. This was paired with markers, paper, and an endless supply of Post-it notes!

After ten minutes, we asked our participants to answer a set of questions to evaluate their trust in working with the AI. We also wanted to learn more about their process of working verbally with an AI to understand their comfort levels and other potential applications they might propose using a similar feature.

Reflection

Moving forward, we have outlined several enhancements for future iterations of similar research endeavors. Firstly, we plan to refine our post-activity questions by adopting a Likert scale for Q&A sessions. This model will require participants to assign numerical ratings to various aspects of human-AI teaming, providing a more structured approach to data collection and analysis. Additionally, we will ensure that participants have access to the conversation transcript as they engage with the AI agent, addressing a suggestion highlighted by our participants.

In response to valuable feedback from our faculty and advisors, we are committed to simplifying the participant prompt. Specifically, we propose confining the scenario to Earth to mitigate the cognitive load associated with task complexity. By limiting the scope in this manner, we can investigate the influence of task complexity on trust, presenting participants with three distinct levels based on factors such as the number of guests or the venue setting. This adjustment will facilitate a more focused examination of trust dynamics within the human-AI interaction context.

Next Steps

Although we won’t be pursuing both studies further at this stage, we recognize the potential enhancements and adaptations we could implement if circumstances allow. Leveraging the insights from our experiments and our cumulative knowledge from past research activities, we’ve begun crafting an early-stage pretotype concept for a field testing guide — currently manifesting in a smart chatbot. Our observations during the sessions revealed that voice-based interactions tended to feel more fluid compared to text-based interactions. Initially, we’ll focus on high-level interactions and functionalities. We intend to fine-tune our pretotype assumptions through collaborative sessions with CMU researchers. We’ll lead them through diverse scenarios to evaluate how effectively the guide supports their research efforts and ascertain its significance to them. These observations will direct our brainstorming sessions for refining specific functionalities and identifying additional use cases as we wrap up the spring semester and gear up for summer development endeavors.

~ Vroom Vroom~

This project is not intended to contribute to generalizable knowledge and is not human subjects research.

Read Sprint 5 here!

Get all the latest updates from the team! Follow 99P Labs here on Medium and on Linkedin!

--

--

Honda Research Institute MHCI @ CMU
99P Labs
Writer for

Hi there! We’re team Hondasss from Carnegie Mellon's MHCI program on our 8-month journey defining the future of Human-AI Teaming for Honda Research Institute!