How to Evaluate Chatbots (and Conversational User Interfaces) With Heuristics?

Gary Hsieh
6 min readAug 4, 2023

In this article, we present a set of heuristic for evaluating chatbots and conversational user interfaces. We also present how we arrived at these heuristics and some insights we gathered through this work.

keywords: heuristic evaluation, chatbots, conversational agents, conversational user interfaces (CUI), UI/UX design

Heuristic evaluation is a common technique for a quick and formative evaluation of user interfaces. All you need is an interface to evaluate, a few evaluators, and a set of heuristics (i.e., usability principles). The evaluators will examine the interface and judge its compliance with the heuristics. Jakob Nielsen developed this technique and the 10 heuristics that is widely used to evaluate user interfaces.

However, the 10 original heuristics proposed by Nielsen may not be appropriate for all kinds of interfaces. This has lead to adaptation of these heuristics for different technologies, such as extending it in VR and peripheral displays. For chatbots like ChatGPT or conversational user interfaces like Alexa, some of the original heuristics may be less meaningful (e.g., what does it mean to have searchable help documentation for conversational interfaces?).

Thus, we adapted the original heuristics developed 11 heuristics for evaluating these conversational user interfaces (CUIs). We found that evaluators were able to identify more usability issues with chatbots using our heuristics than the original heuristics. Our research was published at ACM Conference on Human Factors in Computing Systems.

11 Heuristics for Evaluating Conversational User Interfaces (CUIs)
CI-H1. Visibility of system status
CI-H2. Match between system and the real world
CI-H3. User control and freedom
CI-H4. Consistency and standards
CI-H5. Error prevention
CI-H6. Help and guidance
CI-H7. Flexibility and efficiency of use
CI-H8. Aesthetic, minimalist and engaging design
CI-H9. Help users recognize, diagnose and recover from errors
CI-H10. Context preservation
CI-H11. Trustworthiness

Here is the description of our heuristics for CUIs with examples of usability issues

CI-H1. Visibility of system status

The system should always keep users informed about what is going on, through appropriate feed-back within reasonable time, without overwhelming the user.

Example heuristic violations:

  • For a CUI designed to ask a bunch of questions, if there is no feedback on how far along the user is in the questionnaire.
  • For a voice-based CUI, when it isn’t clear to the user if they are in a particular state.

CI-H2. Match between system and the real world

The system should understand and speak the users’ language — with words, phrases and concepts familiar to the user and an appropriate voice — rather than system-oriented terms or confusing terminology. Make information appear in a natural and logical order. Include dialogue elements that create a smooth conversation through openings, mid-conversation guidance, and graceful exits.

Example heuristic violations:

  • If the CUI uses words or terms that are unfamiliar to the users.
  • If the CUI speaks in an unnatural way.

CI-H3. User control and freedom

Users often choose system functions by mistake and will need an option to effortlessly leave the unwanted state without having to go through an extended dialogue. Support undo and redo.

Example heuristic violations:

  • Not providing users the ability to stop or cancel a command.
  • Not providing users the ability to redo a command.

CI-H4. Consistency and standards

Users should not have to wonder whether different words, options, or actions mean the same thing. Follow platform conventions for the design of visual and interaction elements. Users should also be able to receive consistent responses even if they communicate the same function in multiple ways (and modalities). Within the interaction, the system should have a consistent voice, style of language, and personality.

Example heuristic violations:

  • If the same command is called different things or responds inconsistently at different points of interaction.
  • If the CUI uses different tones and voices throughout (without a clear purpose/explanation).

CI-H5. Error prevention

Even better than good error messages is a careful design of the conversation and interface to reduce the likelihood of a problem from occurring in the first place. Be prepared for pauses, conversation fillers, and interruptions, as well as dialogue failures, deadends or sidetracks. Proactively prevent or eliminate potential error-prone conditions, and check and confirm with users before they commit an action.

Example heuristic violations:

  • If the users are provided with an answer set, and they do not have the freedom to choose “none of the above” or to exit out the interaction.
  • In contexts where errors are likely (e.g., a voice-based chatbot where the system is not confident about the user input), if the system does not confirm with users before proceeding.

CI-H6. Help and guidance

The system should guide the user throughout the dialogue by clarifying system capabilities. Help features should be easy to retrieve and search, focused on the user’s task, list concrete steps to be carried out, and not be too large. Make actions and options visible when appropriate.

Example heuristic violations:

  • If the CUI’s capabilities are not clear to users.
  • For CUIs with unique interface elements, if the CUI does not explain or have a feature to explain how the interface works.

CI-H7. Flexibility and efficiency of use

Support flexible interactions depending on the use context by providing users with the appropriate (or preferred) input and output modality and hardware. Additionally, provide accelerators, such as command abbreviations, that are unseen by novices but speed up the interactions for experts, to ensure that the system is efficient.

Example heuristic violations:

  • For menu/button-based CUIs, if users should be able to select from multiple responses but is restricted to select only one.
  • For voice-based interactions, CUIs should support verbal shortcuts for commands.

CI-H8. Aesthetic, minimalist and engaging design

Dialogues should not contain information which is irrelevant or rarely needed. Provide interactional elements that are necessary to engage the user and fit within the goal of the system. Interfaces should support short interactions and expand on the conversation if the user chooses.

Example heuristic violations:

  • If the CUI asks more questions/collects more information than necessary.
  • If the CUI uses too many irrelevant social utterances.

CI-H9. Help users recognize, diagnose and recover from errors

Error messages should be expressed in plain language (no codes), precisely indicate the problem, and constructively suggest a solution.

Example heuristic violations:

  • If an error message does not explain the problem.
  • When the users performs an action that is incorrect or that the CUI does not recognize, the CUI does not constructively help guide users to a solution.

CI-H10. Context preservation

Maintain context preservation regarding the conversation topic intra-session, and if possible inter-session. Allow the user to reference past messages for further interactions to support implicit user expectations of conversations.

Example heuristic violations:

  • If the CUI does not retain a memory of the users previous responses/interactions within the same session (ideally across sessions).
  • If the CUI turns off or “sleeps” too quickly, where the user needs to restart the session from the beginning.

CI-H11. Trustworthiness

The system should convey trustworthiness by ensuring privacy of user data, and by being transparent and truthful with the user. The system should not falsely claim to be human.

Example heuristic violations:

  • If the CUI claims to be a human.
  • If the CUI is not explicit or does not provide a clear feature for users to explore how the data will be stored and used.

As with the original Heuristic Evaluation, all you have to do is recruit a few evaluators, ask them interact with the chatbot several times, and inspect the interface using these heuristics. When evaluators notice heuristic violations, they will make note of the issues and also rate their severity. This evaluation can be done for a fully functioning chatbots, or for low-fidelity and even paper prototypes.

Here are the 3 key ways that these heuristics differ from the original heuristics.

  1. Inclusion of new heuristics to support users’ implicit expectations and to ensure the CUI does not mislead users about its identity, nor withhold important information about how user data will be used (Context Preservation and Trustworthiness).
  2. Balance giving users the required information on how to interact with the CUI, while not overwhelming with too much information to recall (Visibility of System Status and Help and Guidance).
  3. Emphasis on error handling in the heuristics. CUI interactions, particularly voice systems, may lead to confusion when users can’t figure out how to leave a conversation or what actions they can take next.

--

--

Gary Hsieh

associate professor in human centered design @UW, designing technologies to support positive and prosocial behaviors