Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems

Zana Bucinca
Harvard HCI


This blog post summarizes our work by Zana Bucinca, Phoebe Lin, Krzysztof Gajos and Elena Glassman that won the Best Paper Award at the 25th International Conference on Intelligent User Interfaces.

Artificial Intelligence (AI) supports us in making more and more decisions. Sometimes, AI is deployed to help users with relatively low-risk decisions, like when to leave the house to catch a bus. However, AI is also being deployed to help us make life-changing decisions, like whose bank loan to approve or which neighborhoods to send more police officers to. In these contexts, it is dangerous to rely on AIs alone. Even if highly accurate, AIs might systematically err in decisions that humans — with their domain expertise and awareness of unique contextual factors — would certainly not. Therefore composing human+AI teams has garnered a lot of enthusiasm as a way to get the ‘best of both worlds’ in high-stakes decision making.

Because human+AI teams can harness the strengths of both humans and AIs, experts have theorized that they should perform better than either people or AIs alone. However, recent research shows that this may not be true: when people are teamed up with high-performing AIs, these combined teams usually perform worse than AIs alone.

We believe that part of the problem lies in the pragmatic decisions that we — researchers — make when evaluating our innovations. AI-powered decision support tools are only a part of human+AI teams, and what ultimately matters is how those teams perform on real tasks. Yet, when new technologies are created, they are typically evaluated in simplified settings that do not reflect the real context of use.

Our findings, reported in a brand new paper, demonstrate that such simplified evaluations may produce misleading results giving us a false sense of progress and possibly directing our efforts in unproductive directions. Specifically, there are two main findings that highlight the pitfalls of current evaluation trends for human+AI teams:

  1. Results of human+AI team performance on proxy evaluation tasks, such as how well people simulate an AI’s decision, do not predict performance of a human+AI team on actual tasks.
  2. Subjective evaluation measures, e.g., humans’ rating of trust in an AI, do not predict their performance as a human+AI team on actual tasks.

Finding 1: Results from proxy tasks do not predict results from actual tasks

One decision we often make is to use proxy tasks for evaluation, such as asking a human to predict the AI’s decision given an AI’s explanation of its reasoning or its underlying model. We assume that the results of these tasks will predict how well people can predict if an AI makes a good or bad recommendation — a skill necessary to use a decision support tool effectively.

Figure 1. Proxy task with (a) example-based explanations and (b) feature-based explanations. These two kinds of explanations require different kinds of reasoning to understand (inductive reasoning for example-based explanations, and deductive reasoning for feature-based explanations) and are, therefore, predicted to require different kinds of mental effort to use effectively.

Our results suggest that such evaluation tasks do not predict how well human+AI teams will perform on actual decision-making tasks. With the same underlying AI, we asked people to complete proxy tasks (i.e., What will the AI decide?) and actual tasks (i.e., What is your decision?). When performing proxy tasks, participants trusted, preferred more and rated example-based explanations (shown in Fig. 1(a)) as less mentally demanding than feature-based explanations (Fig. 1(b)). Whereas, in the actual task, the opposite was true: participants trusted, preferred, and rated feature-based explanations as less mentally demanding than the example-based explanations.

Figure 2. Results of proxy tasks do not predict the results of the actual tasks.

Finding 2: Subjective measures do not predict objective performance

We, also, often design our explanations and explainable interfaces based on subjective evaluation measures, such as preference and trust. We assume that what people say works best for them, actually works best for them.

In our study, however, these subjective measures of preference and trust did not predict how the human+AI teams performed on the actual task. Participants trusted and preferred more the feature-based explanations, yet they were able to detect AI’s mistakes significantly better (and, consequently, make better decisions) with the example-based explanations.

Figure 3. Participants trusted and preferred more the feature-based explanations (a), yet they were able to detect AI’s mistakes significantly better with the example-based explanations (b).


We believe the difference in measures between the proxy task and actual task was caused by a difference in how people allocate their cognitive effort in the two scenarios. In the proxy task, people were asked specifically to focus on the recommendations and explanations provided by the AI. In contrast, during the actual task, people were asked to make some difficult decisions with the aid of the AI-provided recommendations. Because, as humans, we are adept at conserving cognitive resources, people working on the proxy task engaged carefully and analytically with the explanations provided by the AI, but during the actual tasks, people’s focus was on making good decisions and many appear to have considered the AI’s explanations only superficially. Consequently, when the AI made an incorrect recommendation, participants who focused on the actual task were less likely to carefully examine the AI’s explanation and spot the error than those working on the proxy task.

The differences between subjective preferences and objective performance can be similarly explained. Evaluating an AI’s explanation and incorporating that information into one’s own decision-making requires cognitive effort. However, prior research provides ample evidence that preference does not predict performance in cognitive tasks, because people might prefer simpler constructs, yet perform better with more complex ones. Thus, our participants working on the actual task said that they trusted more and preferred the feature-based explanations (which were more concise and might have appeared simpler), but they were more likely to spot AI’s mistakes and, consequently, made more correct decisions when presented with the example-based explanations.

Implications for research on new AI technologies

In short, our results suggest that the only reliable way to assess the effectiveness of an AI-based decision aid is to assess the performance of real human+AI teams working on real tasks.

In our study, two of the currently-common shortcuts — using simplified “proxy” tasks and collecting subjective measures of trust and preference — both led to different conclusions than studying the actual performance on an actual task.