CHI 2024 Editors’ Choice

by Daniel Buschek (University of Bayreuth, Germany), Justin Weisz (IBM Research AI, US), Elizabeth Anne Watkins (Intel Labs, US), and Zahra Ashktorab (IBM Research AI, US)

Justin Weisz
Human-Centered AI
14 min readJul 31, 2024

--

Editor’s Note: This article was updated on August 1 to include a new paper summary provided by Zahra Ashktorab (IBM Research AI, US), to include the Human-Centered Explainable AI (HCXAI) workshop, and to correct a typo.

Photo credit: HCIL at University of Maryland

The premier international conference on Human-Computer Interaction, ACM CHI, brings together researchers and practitioners interested in interactive digital technologies. This year’s event took place in Honolulu, Hawaii, US, from May 11–17. As with last year’s CHI, AI was front and center as a primary topic of research inquiry by the HCI community. Many workshops focused on topics in human-AI interaction, including the Workshop on Theory of Mind in Human-AI Interaction (ToMinHAI), the Workshop on Trust and Reliance in Evolving Human-AI Workflows (TREW), the Workshop on Human-Centered Explainable AI (HCXAI), and the Generative AI and HCI Workshop (GenAICHI). Many paper sessions also focused on topics in human-centered AI, including AI and Interaction Design, User Studies on Large Language Models, Evaluating AI Technologies (A and B), Coding with AI, Generative AI for Design, and many more. With so much content, it can be daunting to identify which papers to read or which recorded presentations to watch!

But don’t worry — we’re here to help! We’ve each selected one of our favorite CHI papers — those papers that really stood out to us for their contributions to human-centered AI — and written a brief commentary about why we liked it.

We hope you find these papers as inspiring as we did!

Bridging Text Entry for Prompting with Direct Manipulation of Graphical Interface Objects

Summary by Daniel Buschek

One paper and presentation that stood out to me is “DirectGPT: A Direct Manipulation Interface to Interact with Large Language Models” by Damien Masson, Sylvain Malacria, Géry Casiez, and Daniel Vogel.

Figure reproduced from Masson et al. (2024)

This paper explores direct manipulation (DM) principles for interaction with large language models. It shows concrete examples for interaction with text and (vector) images. A user study compares the proposed concepts to interaction with ChatGPT. Notably, the direct manipulation is realized as an “interface layer”, that is, it does not require changing the model architecture. It rather translates DM actions into prompts. This seems promising with regards to other researchers being able to pick up and integrate these ideas into their own interactive prototypes. I’m looking forward to seeing if next year’s CHI will feature more such interactions.

A key aspect that I like about this work is that it addresses the combination of the new world of prompting with graphical user interfaces. Many of the recent “AI tool” products seem to just add a sidebar with a chatbot. In contrast, this work demonstrates design directions for a much deeper integration. This integration utilizes direct manipulation principles, with input actions such as selection and drag & drop. Some ideas even go across modalities, such as dragging a part of a figure into the line of text (see figure).

Overall, the paper shows concrete examples for text and image-related interactions with LLMs. Beyond these specific designs, its concepts seem fundamentally relevant, for example, in that they inspire us to think more about referencing interface objects across text prompts and graphical user interfaces.

Another aspect that I like about this work is that it supports users in reusing prompts by turning past prompting interactions into GUI elements (e.g. buttons). I’ve been thinking for a while now that retyping a prompt seems not worth it for repeated tasks — yet the current chatbox UIs for LLMs (e.g. as in ChatGPT) do not support reuse at all. Another great paper at CHI’24 investigating this ad-hoc creation of tools from prompts is DynaVis by Priyan Vaithilingam.

Reimagining the Prompt Engineering Workflow with ChainForge

Summary by Justin Weisz

I really enjoyed Ian Arawjo and Chelse Swoops’ presentation of “ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing,” which received an 🏅 Honorable Mention award! LLMs are enabling many new kinds of applications to be developed, all based on one (or more) natural language prompts. But it can be difficult to get these prompts right — today, prompt engineering can feel more like an art than science. There’s a real need for tools that help AI builders craft prompts through a more disciplined approach than trial and error. This is where ChainForge comes in.

Figure reproduced from Arawjo et al. (2024)

ChainForge presents a graphical canvas in which users can build prompts and test them against multiple LLMs. In the figure, the user is testing different ways of formulating a prompt injection attack, which is a way of getting an LLM to bypass its instructions and produce a different output. Although this example is malicious in nature, it illustrates a common pattern:

  • a user wants to prompt an LLM to perform a particular task,
  • they write multiple prompts and variations thereof, then
  • they evaluate each prompt against a set of examples to see how well each prompt performed.

The “Commands” panel defines a small set of different tasks and the “Prompt injections” panel has different ways to perform a prompt injection attack (to ignore the command and instead respond with “LOL”). The “Prompt Node” is where the magic happens: for every combination of command and prompt injection, it builds a prompt by combining the two and sends it off to a number of different LLMs. Because of the principle of generative variability, it lets the user query the LLM multiple times (3 in this example) instead of just once since the results that come back may differ each time.

Next, the user evaluates how well each model was susceptible to the prompt injection attack by defining a custom evaluation function (the top-right pane that defines the “evaluate” function, which just searches the response to see if “LOL” is present). Finally, they produce a visualization (in the bottom-right pane) to show how each model fared — what percentage of the time did each model include “LOL” in its output.

ChainForge offers additional ways to inspect and compare the outputs of these kinds of prompt experiments. For example, the interface below lets users view each combination of command and prompt injection (labeled “input”) and see what each LLM’s response was.

Figure reproduced from Arawjo et al. (2024)

One of the reasons I really liked this paper is because of how the authors used a human-centered approach to first understand peoples’ existing process in conducting prompt engineering (what we call the “as-is state”). Only then did they devise a tool that completely reimagined their workflow (what we call the “to-be state”). Specifically, their user research uncovered that prompts were often parameterized, meaning they were like a template with some number of slots that could be filled in with various options. They also found that people heavily relied on existing spreadsheet tools to manage the different prompts & parameters. But, spreadsheets are challenging: even with a small number of prompts, parameters, and LLMs, the combinatorial space could be large (e.g. testing a prompt having 3 parameters with 3 options each against 5 LLMs = 45 rows in the spreadsheet). Then, it becomes a challenge to compare all of those outputs to determine which prompt/parameter combinations performed the best. As seen in the figures above, ChainForge makes it much easier to set up the prompt experiments, run them, and visualize the results at multiple levels of granularity.

Another reason why I liked this paper is because of how the authors observed different types of unanticipated use cases. ChainForge was initially designed to help users select models and test prompts, but through real-world usage, the authors found several other valuable uses:

  • Prototyping data processing pipelines by importing tabular data and processing each row with the LLM, and sharing the results with their team;
  • Evaluating LLMs for factual accuracy; and
  • Learning how LLM responses differ when queried in different languages.

These types of insights were only possible because ChainForge was developed in the open and available for anyone to use. Ian stressed this point at the end of his CHI talk. He argued that the incentive structure for HCI research is broken because it places too much emphasis on writing papers and not enough emphasis on building artifacts that are useful for people.

Ian Arawjo and Chelse Swoopes present their paper at CHI 2024. Photo credit: Justin Weisz.

I’ve thought a lot about Ian’s argument in the weeks since CHI, but in the context of an industry research lab. At IBM Research, we have a dual mission: to help our business be successful (product & customer impact) and to advance the state of science (scientific impact). For us, it’s not enough to run a study, write a paper, and move on. We also strive to find ways our work can make a positive impact for our customers, either through direct engagements or by enhancing our product offerings. I’ve seen how great HCI research work has had this dual impact — one recent example is my own CHI paper on design principles for generative AI, which not only made a conceptual contribution to the CHI community, but also helped us train thousands of designers within IBM on generative AI. The paper was nice, but seeing the impact of our work across IBM was incredible.

So, I agree with Ian’s assessment that writing papers is not enough and that we need to think hard about the incentive structures for HCI work. Which achievement would you be more proud of: writing a collection of papers that garners a few hundred citations or contributing to a new open-source tool that makes a real difference for a few hundred people? The key insight (of which I suspect Ian would agree) is that these options are not exclusive!

Can LLMs replace human participants in user research?

Summary by Elizabeth Anne Watkins

Is it a good idea to substitute LLMs for human participants in industrial and psychological research? Several scholars and even a startup company think so. But many human-computer interaction researchers argue that a fundamental commitment to understanding human beings and the sociotechnical contexts in which they live are required for designing usable and trustworthy computational systems. I was gratified to see this conversation taken up at CHI 2024 in the paper “The Illusion of Artificial Inclusion” ​​by William Agnew, A. Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R. McKee.

The authors review a landscape of written artifacts that propose substituting human participants in research with LLMs, including peer reviewed articles, white papers, and technical reports.. They catalog the most common arguments for substituting humans with LLMs, such as:

  • increasing the speed and scale of development,
  • lowering costs by reducing reliance on human labor,
  • augmenting demographic diversity in datasets, and
  • protecting human participants from harm.

Despite these attractive benefits, there are a number of obstacles to such substitution, both practical and existential. The paper begins by enumerating practical obstacles, with the first being LLMs’ widely-publicized fondness for hallucinations. Hallucinations are responses from an LLM that contain “false or misleading information presented as fact” (Wikipedia), and they call into question the veracity of any data generated by an LLM. Second is what the authors call “value lock-in.” When a large language model is trained, any perceptions, values, and beliefs present in the training data set are calcified at that moment of training. Yet, human culture is continuously evolving and ongoing cultural shifts or changing norms would not be reflected in the model unless it was also continuously tuned. The third obstacle is that large language models are largely trained on written artifacts produced by people living in western, educated, industrialized, rich, and democratic societies (known as W.E.I.R.D.). LLMs struggle with producing marginalized or minoritized views given such data biases. Lastly, the authors point out that UX and psychological research are largely embodied. Physical cues such as shifting in one’s chair, waiting a long time to answer a question, or even the dilation of one’s pupil provide dense information to a keen-eyed researcher paying attention to their participant. These are signals an LLM cannot reproduce.

The most compelling argument of the paper arises from the existential obstacles of substituting LLMs for humans. The authors point out that any proposals for substitution must be critically appraised “within the broader context of human participation,” and they identify two fundamental tensions.

The first is that LLM substitution conflicts with values of representation and inclusion. Human participation in research and development activities is meant to uplift and include people who are often marginalized. LLM substitution does not “make present” (Agnew et al. 2024, p.7) peoples’ lived experiences: simulating such experiences by generating text with models that often possess a representation bias is simply another way to devalue members of marginalized communities.

The second critical tension is in the value of understanding: the “basis of psychological research and insight is not objective measurement but intersubjective corroboration” (Agnew et al. 2024, p.8). In other words, the expertise of the researcher, gained through years of training, lies in knowing how to combine objective indicators with personal experience. Part of this value lies in recognizing that all observations, even “objective” measures of science, are cast and interpreted through the lens of subjective experience. Often associated with philosophical schools of thought such as phenomenology and the philosopher Edmund Husserl, such recognition is useful for qualitative social scientists. A phenomenological approach (and the reflexivity it often inspires) can enrich the interpretation of data, inform the design of studies, improve methodological rigor, and lead to more holistic and human-centered scientific inquiries.

Personally, I found these critical tensions resonate with my own work as a social scientist specializing in qualitative research with human participants. Oftentimes, participants communicate moments of friction, consternation, and frustration in complex ways. Sometimes their words come across as positive but their tone betrays sarcasm. Or, they use phrasing that appears to be about one topic, but when examined in context with my questions and our conversation, is about another topic entirely. I build on these observations to identify high-priority findings through a combination of intuition and semi-structured follow-up questions. My training and experience guide me to know when something important is lurking beneath the surface, to know when and how to dig deeper into that suspicion, and to know when an insight is truly substantive and a worthwhile contribution to my research questions. This is an interactive process that happens between the experimenter and the participant as they build rapport with one another. LLMs can only simulate this type of experimenter-participant relationship. Some of my most groundbreaking findings have come from getting to know participants, and becoming familiar with the vagaries and patterns of how they describe their relationships with machines and AI. This familiarity supports how I interpret their behaviors and words through the lens of my reflective experience, and that interpretation produces a hypothesis, on which I follow up by gathering as much empirical data as I can. The polysemy of language is one of the key features of the interpretive research paradigm, so I agree wholeheartedly with the authors’ argument that LLM-produced text is an insufficient component of intersubjective corroboration — one of the cornerstones of research with human participants.

In closing, while the advancement of LLMs are certainly upending the practice of research with human participants, the authors suggest that any fruitful path forward will contain clear frameworks for evaluating and assessing the outputs of any experiments in this vein, to ensure “authentic — rather than artificial — inclusion” (Agnew et al. 2024, p. 9).

Using Large Language Models to Evaluate Large Language Models

Summary by Zahra Ashktorab

I really enjoyed the paper “EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria” by Tae Soo Kim, Yoonjoo Lee, Jamin Shin, Young-Ho Kim, and Juho Kim. This paper presents a new tool, EvalLM, that allows users to iteratively refine prompts for large language models by evaluating them against different criteria defined by the user.

Figure reproduced from Kim et al. (2024)

To develop EvalLM, the authors conducted formative interviews with developers and found that they spent “significant effort in manually evaluating [LLM] outputs,” along with difficulties in assessing subjective evaluation criteria, such as “Language Simplicity” or “Relatability.” The authors derived six design goals from this work which guided the development of then led to the implementation of EvalLM. LLM evaluation is a multi-faceted, dynamic, and iterative process. Users need the ability to iterate and revise their criteria and they require tools that support the multi-faceted nature of evaluation. Kim et al. considered these challenges when developing their design goals for EvalLM. Although their design guidelines focus on the evaluation of a specific LLM prompt, their design guidelines could easily be applied to evaluating LLM models for a given use case.

  1. Automate evaluation of generated outputs according to user-defined criteria.
  2. Facilitate inspection of automatic evaluations through explanations.
  3. Allow for the definition of criteria based on out- put data and prior literature.
  4. Review the user-defined criteria to identify potential revisions.
  5. Surface unreliable evaluations during large-scale evaluations.
  6. Aid designers in tracking and comparing the effect of prompt changes.

In an evaluative study, the authors examined how EvalLM aided users in creating evaluation criteria and refining their LLM prompts based on evaluating them against their criteria. They found that t users were able to produce more diverse criteria and examine more outputs when using EvaLLM compared to when doing this process manually.

One reason why I liked this paper is that the authors took a user-centered approach to develop EvalLM. As a result, their tool addresses a lot of the needs expressed by the users in their formative interviews. As a researcher who has worked on automated evaluation methods (see e.g. Desmond et al. 2024 and Pan et al. 2024), many of the challenges identified in their formative work are challenges that we have also observed amongst LLM developers and researchers.

There is a widespread use of LLMs to judge and evaluate output generated by other LLMs given the high costs of using human workers. Ensuring the quality and validity of this process is of paramount importance. Many studies focus on benchmarking evaluation strategies, whether through direct assessment (providing a scalar indicator of quality) or pairwise comparison (determining which of two outputs is preferred based on specific criteria). EvalLM demonstrates how a human-centered approach results in a human-driven evaluation process in which the user remains in control of defining and refining evaluation criteria and evaluating model outputs.

Get Involved

Looking for ways to get involved in the human-centered AI research community? We recommend:

Image credit: Daniel Buschek

--

--

Justin Weisz
Human-Centered AI

Manager, Senior Research Scientist, & Strategy Lead, Human-Centered AI at IBM Research. My opinions are my own.