CHI 2023 Editors’ Choice

by Werner Geyer (IBM Research, US), Vivian Lai (Visa Research, US), Vera Liao (Microsoft Research, Canada), Justin Weisz (IBM Research, US)

Published in

Human-Centered AI

10 min readJun 9, 2023

Photo of #CHI2023 taken by fotoduda. More photos available at https://chi2023.acm.org/for-attendees/photo-gallery/

The premier international conference on Human-Computer Interaction, ACM CHI, brings together researchers and practitioners interested in interactive digital technologies. This year’s event took place in Hamburg, Germany from April 23–28. Artificial intelligence was front and center at CHI with many sessions on topics including Human-AI Collaboration, Cognition and Bias, AI for Health, Trust, Transparency and Fairness, Interactions with AI and Robots, and Conversation, Communication and Collaborative AI. The searchable version of the program contains links to published papers and presentation recordings.

Researchers in human-computer interaction and related fields presented an astounding body of work at CHI this year, and 879 works were accepted to the conference. For this article, a few of our editors identified papers that really stood out to them because they were interesting and made a significant contribution to human-centered AI. We hope you find these papers as inspiring as we did

Note that the selection of papers identified below wasn’t based on a systematic in-depth search for all papers around HCAI at CHI but rather reflects a personal view of our editors. Given the volume of papers at CHI, we most certainly missed many other great papers.

Help me help the AI

Summary By Vivian Lai

A number of exciting papers were presented at CHI 2023. However, one of my favorite papers was “Help Me Help the AI”: Understanding How Explainability Can Support Human-AI Interaction by Sunnie S. Y. Kim, Elizabeth Anne Watkins, Olga Russakovsky, Ruth Fong, and Andrés Monroy-Hernández. This work investigates how explainability in AI applications can support human-AI interaction. Using a real-world AI application, the Merlin bird identification app, as a testbed, they conducted a mixed-method study with 20 users. Unlike prior studies that use proxy testbeds, this study examined the actual needs of AI application users and offers practical insights into building AI applications. The most fascinating finding of this study is that users are interested in improving their collaboration with the AI system (and thus “Help me help the AI”), rather than understanding the AI’s outputs or predictions, which is, arguably, the traditional purpose of AI explanations.

The authors addressed three questions in their work:

What are end-users XAI needs in real-world AI applications?
How do end-users intend to use XAI explanations?
How do end-users perceive existing XAI approaches?

What are end-users XAI needs in real-world AI applications?

Although participants were curious about the details of the AI used by the application, only those with a high-AI background and/or high-domain interest actively sought out this information. Participants unanimously expressed the need for information to improve collaboration with the AI system. This information included the AI’s capabilities and limitations and the AI’s confidence.

How do end-users intend to use XAI explanations?

Besides understanding the AI’s outputs, participants intended to use explanations for calibrating trust, improving their own task skills, collaborating more effectively with AI, and giving constructive feedback to developers.

Figure reproduced from Kim et al. 2023 showing four different kinds of XAI explanations

How do end-users perceive existing XAI approaches?

Participants rated four kinds of XAI approaches: heatmap-based, example-based, concept-based, and prototype-based explanations. Overall, they most preferred prototype-based explanations. This finding was not particularly surprising as the heatmap-based explanations were confusing and challenging to understand. Similarly, while example-based explanations were helpful to a certain degree, they did not add value to examples that are already shown in the application with identification results, since there are no additional annotations or details. Lastly, as prior work has shown, the coefficients used by concept-based explanations could be difficult to understand and interpret by non-AI background users. Although prototype-based explanations have shown to be effective in explaining AI output for visual tasks, they do not necessarily have the same effect on textual tasks.

Kudos to the authors for studying end-users’ explainability needs in a real-world context, and I hope this will encourage more research revolving around real-world needs and usage!

What do we want when we write with AI?

Summary By Vera Liao

There has been much excitement around Large Language Models (LLMs) at CHI this year. And one of the most popular applications of LLMs is to provide writing support, for social interactions, scientific writing, creative work, and more. This excitement even dominated many conversations I had with people at CHI. While I am also excited about this topic, I found myself more drawn to works that show risks of writing-support technologies, and ones that point out that there is still a long way to go for us to understand and support what people truly want from these technologies.

Among a few other papers conducting experimental studies on writing with AI (e.g., compare different types of suggestions or different prompting strategies), I was quite struck by this paper Co-Writing with Opinionated Language Models Affects Users’ Views by Maurice Jakesch, Advait Bhat, Daniel Buschek, Lior Zalmanson, and Mor Naaman. By asking participants to write about a controversial topic with the assistance of an LLM with certain biases (through biased prompt design), they show that the language model’s biases not only influenced participants’ writing but also shifted their own attitude on the topic measured after the writing task. The majority of participants were also not aware of the model’s biases and its effects on their writing and attitude. The authors call out individual and societal risks with such latent persuasion by language models, including being exploited for targeted opinion influence.

Occasionally, I hear the argument against considering risks of AI writing support “they are just like calculators.” I think the study is a sobering call that writing is often not just an instrumental task but plays a fundamental role in our learning, communication, as well as attitude and value formation, and that writing-support technologies can have profound impact on how we see the world.

Another paper that I enjoyed and made me further contemplate the risks of AI writing support is Social Dynamics of AI Support in Creative Writing by Katy Gero, Tao Long, and Lydia Chilton. The authors interviewed 20 creative writers about their writing practices and their attitudes toward writing support (both human and computer), and laid out answers to three important questions: what (tasks) writers desire help with? How do writers perceive support actors? What are the values that writers hold for writing? The last point is especially interesting: besides “creativity”, the creative writers that they interviewed put even slightly more emphasis on “authenticity” (their own voice) and “intention” (their own goals or subjectivity). One can easily see that AI writing-support tools do not always support, and can even threaten, authenticity or intention. There might even be inherent tensions in these values. While I do worry about the harms that ill-designed technologies can do to authors’ agency, control, authenticity, intrinsic motivation, and well-being, I remain curious about how the human-AI writing partnership and writers’ values will evolve with this new generation of writing-support tools.

One AI Does Not Fit All

Summary By Justin Weisz

One of my favorite papers was One AI Does Not Fit All: A Cluster Analysis of the Laypeople’s Perception of AI Roles by Taenyun Kim, Maria D. Molina, Minjin (MJ) Rheu, Shuo S. Zhan, and Wei Peng. This work examined how laypeople (i.e. non-AI experts) viewed different kinds of AI systems in terms of how autonomous they are and how much people need to be involved with them. They conducted a large-scale survey (N=727) where people rated different types of AI systems — including an AI housekeeper, an AI personal assistant, an AI driver, an AI journalist, and AI doctor, and more — on several different scales: (1) mind perception, which included the AI’s ability to possess agency and the ability to sense and feel; (2) moral agency, which included the extent to which the AI is capable of moral conduct and the degree to which it relies on predetermined human programming versus having its own intention; and (3) credibility, which included measures of trustworthiness, expertise and competence, and goodwill (e.g. how caring is the AI).

There were two primary findings in this that I found very interesting. First, through a cluster analysis, they identified four categories of AI role: as a tool (low autonomy, low human involvement), as a servant (low autonomy, high human involvement), as an assistant (high autonomy, low human involvement), or as a mediator (high autonomy, high human involvement). Although I’m not a huge fan of the language of “AI servant,” there is value in knowing about these four distinct categories.

Figure reproduced from Kim et al. 2023 showing four different AI role categories

The second major finding is that there were significant differences in peoples’ attitudes toward each of these roles. People felt that tools were less trustworthy, less expert, and less kind than mediators. They also had a less positive attitude toward tools than other kinds of roles, and they were less willing to adopt tools compared to other kinds of roles.

*Figure reproduced from Kim et al. 2023 showing differences in how participants rated the credibility, attitude, and social approval of different AI role categories*

I am intrigued by the possibility that these relationships are causal: if a designer re-frames a “tool” as a “mediator,” will peoples’ expectations follow suit? What about their actual user experience?

I am reminded of the recent debate On AI Anthropomorphism by Ben Shneiderman and Michael Muller. Ben argues that systems like ChatGPT should be framed as tools and that anthropomorphism should be avoided (e.g. by not having ChatGPT use the first-person pronoun “I”). This study shows the consequences of making that decision: people will have lower expectations of trustworthiness, expertise, and goodwill. I leave it as an exercise to the reader to determine if those are desirable outcomes.

Responsible AI is participatory AI

Summary By Werner Geyer

I felt strongly about two papers highlighting the challenges in designing and developing responsible AI systems:

Fairness Evaluation in Text Classification: Machine Learning Practitioner Perspectives of Individual and Group Fairness by Zahra Ashktorab, Benjamin Hoover, Mayank Agarwal, Casey Dugan, Werner Geyer, Hao Bang Yang, and Mikhail Yurochkin, and
Designing Responsible AI: Adaptations of UX Practice to Meet Responsible AI Challenges by Qiaosi Wang, Michael Madaio, Shaun Kane, Shivani Kapania, Michael Terry, and Lauren Wilcox.

Both papers make a strong case for diverse stakeholder participation along the entire AI development lifecycle, up to the end user. They also illustrate the challenges we need to overcome when involving stakeholders during the development process.

Ashktorab et al. studied 24 ML practitioners to understand how they went about making decisions on how to mitigate bias in ML models. When presented with different model views and metrics on toxicity, participants’ choices of a suitable model mostly aligned with the goals of what the model was trained for, including group and individual fairness metrics. In group fairness we require equitable treatment of groups of people, for example, comparable loan approval rates for men and women. The goal of individual fairness is to achieve similar treatment for similar individuals, for example, two job candidates that are different only in gender or name should receive similar treatment.

The authors observed that practitioners were polarized between overpredicting (making sure the majority of toxic comments are identified by the model) and underpredicting toxicity (making sure the number of false positives is minimized), and did not fully understand the consequences of both for affected identity groups, for example, flagging comments as toxic for an identity group although they are not toxic. This polarization often led to participants making incorrect decisions, such as 42% of subjects choosing a model that may cause harm to an identity group. The paper discussed how ML practitioners were guided by their own personal experience when it came to making choices in particular when given the chance to “test drive” the classifier with their own identity groups or coming up with individual fairness utterances. No matter how good a mitigation algorithm is, in practice, without a user-centered and participatory design approach, addressing fairness issues may often fail.

Wang et al. studied a different user group, UX practitioners and Responsible AI (RAI) subject matter experts who were involved in addressing RAI concerns in the design and development of AI products. This paper illustrates three emerging RAI practices, including:

Building and reinforcing an RAI “lens,” which involves becoming ambassadors for RAI for the larger team and sensitizing and educating team members about potential harms AI systems can cause,
Responsible prototyping, which includes understanding users’ mental models and “test driving” ML models during design, and
Responsible evaluation of AI applications, which encourages the inclusion of a diverse set of experiences and perspectives in the development of RAI and calls for deeper user research and user involvement in RAI development.

In practice, Wang et al. found that the involvement of a diverse set of stakeholders is often not possible given product schedules and role expectations of UX practitioners. It is also interesting to note that this work was done in the context of large language models, i.e. UX practitioners had to work with pre-trained models that were given to them without the ability to re-train. This gave rise to UX practitioners engaging in prompt engineering (an activity they were not familiar with previously) to “test drive” the models, uncover and mitigate potential issues, as opposed to data scientists addressing bias issues. The study highlights how UX practitioners’ work evolved to include the “hidden,” and often not acknowledged work of doing responsible AI design (i.e. what they did was often invisible to their superiors and not part of their official job responsibility). The authors identified the need for novel tools and methodologies to support RAI design and development, and highlight the challenge of including diverse groups of users in the evaluation process.

Both papers highlight ample opportunities for follow-up research and the importance of a human-centered approach. They are also an important read for practitioners as they outline how we can make more progress towards fair and unbiased AI systems in practice.