Exploring the Intersection of AI and HCI at CHI: Insights for Legal Research

Sally Gao
Thomson Reuters Labs
6 min readJul 25, 2024

In May 2024, I had the pleasure of attending CHI (pronounced “kai”) in Honolulu, Hawaii. As the top international conference on Human-Computer Interaction (HCI), CHI covers topics such as user experience design, interactive systems, and AI ethics, including issues of trust, safety and privacy and in the context of emerging technologies such as artificial intelligence (AI) and machine learning (ML).

CHI 2024 at the Hawai’i Convention Center in Honolulu.

Why should TR Labs be interested in CHI?

While it’s not an ML conference, TR Labs should be interested in CHI because this community is ultimately about designing and evaluating computer systems and technologies that are used by people. The conference provides a unique platform to gain insights into user experience design best practices for integrating AI-powered features, which can inform our work at the Labs. Conferences such CHI help us stay informed about the ethical and societal implications of AI in the legal domain and elsewhere, ensuring responsible development and deployment of AI-driven legal research solutions.

A topic of particular interest to me and my work in Labs’ Foundational Research team is LLM evaluation. Below, I’ll highlight the most relevant and thought-provoking presentations on the application of large language models (LLMs) and AI in various domains, including the keynote speech by Kate Crawford. I also highlight four presentations that focused on prompt engineering, LLM evaluation, and human-LLM collaboration, highlighting their potential impact on the legal research domain.

Highlighted Talks

Keynote — “Rematerializing AI” by Kate Crawford

Kate Crawford, an influential academic whose writing has been called a “conscience for AI,” delivered a salient keynote on generative AI’s impact on our society, environment, and economy.

First, Crawford discussed how the release of ChatGPT has given rise to an ideological cult in Silicon Valley centered around “X Risk” and “AI Doomerism.” She cautioned against the influence of AI insiders and venture capitalists who believe that Artificial General Intelligence (AGI) poses an existential risk to humanity.

Stock photo

Instead, Crawford thinks we should be paying attention to the material and societal aspects of AI. Examples include the ecological implications of large-scale computation on the demand for minerals, energy and water, as well as the hidden labor behind AI systems and the precarious conditions faced by crowd-workers.

Looking ahead, Crawford emphasized the need for rethinking AI design and development, calling for collaboration between regulators and industry. She discussed the potential for sustainable generative AI, fair conditions for data laborers, and the importance of research transparency in the face of increasing secrecy in the corporate AI landscape.

ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing

ChainForge video presentation

ChainForge is an open-source visual toolkit designed to support prompt engineering and on-demand hypothesis testing of generative LLMs. The authors identified three stages of LLM hypothesis testing: opportunistic exploration, limited evaluation (prototyping and scored evaluation), and iterative refinement (for tweaking, testing, and refining). As we move through the stages, the process becomes less free and more systematic. Notably, ChainForge was developed and released in the open before this paper was written, with many features added as a result of requests from real-world users.

I was excited to see this tool because it addresses a critical pain point in the LLM development process by providing a structured and flexible workflow for hypothesis testing and prompt refinement. Here at TR, I know that many product teams are grappling with the challenges of iterative LLM experimentation. as we continue to explore LLM applications at TR, ChainForge is an example of the type of solution we should consider as we continue to explore LLM applications at TR.

The ChainForge talk ended with a call for HCI developers to “build in the wild” rather than for academia!

A call for HCI developers to “build in the wild.”

EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria

Similar to ChainForge, EvalLM was developed to support prompt refinement workflows. It enables developers to refine prompts by quickly evaluating multiple outputs on user-defined criteria. EvalLM enhances this workflow by introducing an LLM judge to evaluate prompts on diversely sampled inputs.

A comparative study showed that EvalLM helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with fewer revisions compared to manual evaluation. The authors claimed that this process helped users look at the “bigger picture” rather than fixating on specific prompts or input samples, which is definitely important if we want to design prompts that generalize well to future data.

At the end of the presentation, I asked a question about whether the tool addresses or mitigates the potential biases of LLM-generated judgments. While acknowledging the potential for bias, the presenter suggested that manual human evaluation is still the only way to mitigate this problem.

EvalLM video presentation

Human-LLM Collaborative Annotation Through Effective Verification of LLM Labels

This study presents a multi-step human-LLM collaborative approach to ensure the accuracy and reliability of annotations. The process involves 1) using LLMs to generate data labels and explanations, 2) a classification model to verify the quality of the data, and 3) using human annotators to re-annotate potentially incorrect labels.

The authors found that human re-annotation accuracy was higher when provided with both LLM labels and explanations for some datasets, but not others. Interestingly, when the LLMs were incorrect, providing wrong LLM labels hurt human accuracy, but there was no difference between showing only the label or the label and explanation. In the context of our LLM-based work at TR, this raises the question of whether LLM annotations should be shown when creating gold data for Instruction Fine-Tuning (IFT).

Human-LLM Collaborative Annotation video presentation

RELIC: Investigating Large Language Model Responses using Self-Consistency

To address LLM hallucinations for open-ended questions, an interactive system called RELIC helps users gain insight into the reliability of generated text using self-consistency. The idea behind self-consistency is that if an LLM gives inconsistent answers to the same prompt when prompted multiple times, this is a sign that the LLM is not confident and the answer should not be trusted. However, self-consistency is hard to define and measure when it comes to open-ended tasks, where there are many possible acceptable answers. In response, the authors developed a “divide and conquer” strategy that extracts atomic facts (claims) from the generated text, uses other responses to verify each claim separately using a smaller LLM, and iteratively verifies all the claims. Each claim is turned into a question that is used to find information from the other responses.

While this method strikes me as potentially costly, it presents an interesting idea for hallucination detection, which is a challenging problem with no existing solutions. On the upside, LLM calls are getting cheaper and it’s possible that fact extraction could be done using smaller models.

RELIC video presentation

Overall Impressions

  • There is growing interest in prompt engineering, LLM evaluation, and human-LLM collaboration in the HCI research community. For legal research, these advancements could lead to more accurate and reliable AI-assisted tools.
  • However, it is crucial to consider the potential biases and errors introduced by LLMs and to develop robust evaluation methods to ensure the trustworthiness of the generated content.
  • As the field continues to evolve, we can expect to see more solutions that leverage the strengths of LLMs to revolutionize domain-specific applications that are important to TR.

I really enjoyed attending CHI 2024! The conference provided a unique opportunity think about responsible AI development, the need for robust evaluation methods, and the potential for human-LLM collaboration to enhance AI-assisted tools. I hope some of these insights will help us continue to develop innovative solutions that meet the needs of our customers as we continue to explore LLM applications at TR.

--

--

Sally Gao
Thomson Reuters Labs

UVA MSDS ’18. Interested in data science, machine learning, data journalism.