Design for Artificial Intelligence

Kore
The Startup
Published in
13 min readApr 29, 2020

--

In traditional software systems, outcomes are discrete and the creators are aware of the behaviour of these systems (rigid set of instructions). AI/ML-based systems bring a fundamental shift to this way of thinking where instead of programming a system to do a certain action, it’s creators provide data and nurture them to curate outcomes based on the input. These systems learn over time.

Source: https://woktossrobot.com/aiguide/

Systems with AI/ML components are probabilistic in nature and the product might work differently for different users. There are three key considerations that product creators need to keep in mind when building AI products:

  1. Building trust with users of the AI system
  2. Designing feedback — Helping users improve the AI model
  3. Handling errors and failures — What to do when things go wrong

Building trust with users of the AI system

Because AI-driven systems are based on probability and uncertainty, it is important for users to understand how and when to trust the system. The right level of explanation is key to helping users understand how the system works. When users have the right level of control over the system, they can understand how and when to trust it to help accomplish their goals.

Explaining how the AI system works is key to building trust. When onboarding users it is important to explain what the system can do and how well can it do it.

Users shouldn’t implicitly trust your system, rather the system should help the user in assessing whether a result is trustworthy. There are cases where some users are completely averse to AI decisions while in others they are too trusting of the system. Ideally, both these extremes need to be avoided.

Build trust with the user’s of your AI system by:

  • Explaining how the system uses data
  • Explaining confidence of system results
  • Partially explaining in cases where giving a full explanation would lead to confusion
  • Reducing bias in your datasets

Explaining how the system uses data

Whenever possible, the AI system should explain the following aspects of data use:

Scope
Show an overview of the data being collected about the user and which aspects of the data are being used for what purpose. At the same time, the system should inform the user that a lack of data might mean that they need to use their own judgement. Counterfactuals may be used to explain why the AI did not make a certain decision or prediction.

Scope (Source: https://woktossrobot.com/)

Reach
Explain whether the system is personalized to the user (Spotify daily playlist) or aggregated (Amazon product suggestions) across all users.

Reach (Source: https://woktossrobot.com/)

Editability
Tell users whether they can remove, reset or edit the data being used by the AI system.

Editability (Source: https://woktossrobot.com/)

Explaining confidence of system results

Confidence levels are a readily available AI system output and are a great tool that can be used to explain the workings of an AI system. Confidence levels help users gauge how much trust to put in AI systems.

Confidence level is a statistical measurement that indicates how confident the AI system is about a particular outcome.

Let’s assume our biryani detector categorizes results in low, medium and high confidence levels. For each result type, we design it to answer differently. When you give it an image of biryani, it has a high level of confidence in its result and we design it to say “That’s biryani! Looks delicious.” When you give it an image of a pulao which somewhat resembles a biryani, we design it to say “Hey this might be biryani! You can tell me if it’s not.” When you give it an image of a pasta, we design it to say “That’s not biryani!.”

Source: https://woktossrobot.com/aiguide/

A confidence level is often a value between 0 (no confidence) and 1 (full confidence) which is converted into a percentage. In most real-world scenarios, you will never get a definite 0 or 1.

There are multiple ways of indicating confidence levels:
Because different user groups may be more or less familiar with what confidence and probability mean, it’s best to test different types of displays early in the product development process.

Categorical
This involves categorising confidence values into buckets (high, medium, low). The category information can be further used to render UI, alter messaging and indicating further action that the user needs to perform. Cutoff points for the categories are decided by the team creating the system.

Categorical (Source: https://woktossrobot.com/)

N-best alternatives
Rather than showing one result with an explicit confidence level, in this method, multiple results are shown with their confidence levels indicated. E.g. This photo may contain a goat, llama or a ram. This approach is especially useful in low-confidence situations.

N-best alternatives (Source: https://woktossrobot.com/)

Numeric
This method utilises simple percentage values to indicate confidence levels. Showing numeric values is risky since it assumes that users have a baseline understanding of probability (is 80% confidence high or low?).

Numeric (Source: https://woktossrobot.com/)

Partial explanation

There are cases when there is no benefit in providing an explanation or giving an explanation would lead to confusion. For e.g. When adjusting the screen brightness based on the environment, it might not make sense to prompt the user every time the system performs the action. Partial explanations intentionally leave out parts of the system’s workings that might be too complex, unknown or unnecessary. Partial explanations can be general or specific.

General
Explaining how the AI system works in general terms. E.g. This app recognizes food with an 80% accuracy in general.

Specific
Explaining why the AI provided a particular output at a particular time E.g. When recognizing hotdog, the system shows similar images to indicate what data the system used to recognize the food item.

Bias

Machine learning is not a pipeline, it’s a feedback loop. Bias in the AI result is a reflection of the bias present in the data used to train the AI. Training data, even if it is collected from the real world will amplify existing biases. Committing to fairness at each step of ML design is key to tackling bias, thereby building trust with users.
When using real world data, human bias will be invariably introduced due to personal experiences and inherent biases. These patterns might be amplified even further by the AI model. Often minimising bias comes down to reducing bias in the data used to train the AI.

Source: https://woktossrobot.com/aiguide/

How ML systems can fail users

  • Representational harm
    When a system amplifies or reflects negative stereotypes about particular groups.
  • Opportunity denial
    When systems make predictions and decisions that have real-life consequences and lasting impacts on individuals’ access to opportunities, resources, and overall quality of life.
  • Disproportionate product failure
    When a product doesn’t work or gives skewed outputs more frequently for certain groups of users,
  • Harm by disadvantage
    When a system infers disadvantageous associations between certain demographic characteristics and user behaviours or interests.

Recommended practices to reduce bias

There is no standard definition of fairness and no single correct model for all ML tasks. Here are a few recommended practices.

  • Update training and test datasets frequently
  • Engage with researchers, social scientists, designers and other experts to generate diverse perspectives.
  • Use representative datasets for model training
  • Public datasets can be especially biased and often need to be augmented to generate unbiased results. Understanding various perspectives is key to reducing bias.
  • Collect feedback regularly

Designing feedback

Helping users improve the AI model

User Feedback

User feedback is the communication channel between your users, your product, and your team. Leveraging feedback is a powerful and scalable way to improve your technology, provide personalized content, and enhance the user experience.

Align feedback with model improvement
When users have the opportunity to offer feedback, they can play a direct role in personalizing their experiences and maximizing the benefit your product brings to them. It is important for creators to design feedback systems for tuning the model. Ideally, what people want to give feedback on aligns with what data is useful to tune your model. However, that’s not guaranteed. Find out what people expect to be able to influence by conducting user research. For example, when using a video recommender system, people may want to give feedback at a different conceptual level than the AI model understands. They may think “show me more videos about parenthood” while the model interprets “show me more videos by this creator”.

Reward function
An AI model is guided by a reward function, also known as ‘loss function’ which helps the model determine whether it’s results were correct or incorrect (or a percentage value of how correct was it). Think of it as a reward that the system receives every time it guesses correctly. The AI system will try to optimize for this reward function. The goal of the feedback mechanism is to feed into the reward function by collecting user input.

Types of feedback

Implicit
Data gathered about user interactions from product logs. This type of feedback is not asked but users need to be aware that the system is collecting this feedback and get their permission. User’s also need a way to view their logs and have an option to opt out. All this information can be provided in the terms of service.

Explicit
Users deliberately provide this feedback and is often qualitative in nature like whether a recommendation was useful or an answer was incorrect. E.g. forms, like, dislike, ratings, open text fields.

Dual
This type of feedback can contain both implicit and explicit signals. Liking a piece of content can signal an explicit positive response as well as tune the recommendation algorithm implicitly

Why users give feedback?

  • Material rewards E.g. Cash rewards, coupons, gifts
  • Symbolic rewards E.g. Virtual badges, social proof
  • Personal utility E.g. Bookmarking, tracking runs
  • Altruism. In many cases, people are just nice. E.g. Community building on Quora
  • Intrinsic motivation E.g. Venting about a product on Twitter

Designing feedback

Most valuable ML systems evolve with their user’s maturity while using the system. Feedback needs to be mutually beneficial to your users and the model. You ideally want to provide clear mental models that encourage users to give feedback

Fail gracefully
Look at failures as opportunities to improve the system. Users will be disappointed the first time the system fails, however, if the mental model establishes the idea that the system improves over time, then the failure can establish a feedback relationship. Once this is established, failure can be perceived as something that is not only forgivable, but also as something they can help fix.

Align the perceived and actual value of giving feedback
The feedback mechanism should tell users why feedback needs to be given and what the benefit of it is. This also helps in avoiding meaningless responses.

Connect feedback with changes in the user experience
Acknowledging user feedback can build trust, however, it is even better if the product shows how the input influences the AI and their experience. E.g. Providing genre feedback can change your Spotify playlists

Provide editability
User preferences change over time. Feedback mechanisms should enable them to control and adjust the preferences they communicate to the ML model.

Handling errors and failures

What to do when things go wrong

Source: https://woktossrobot.com/aiguide/

For AI-based products, often the experience may differ from user to user. Thereby what defines an error would also differ. Users might test your product in ways you might never have imagined which might lead to false starts, misunderstandings and unexpected behaviours. Designing for these cases is critical when building systems with AI/ML components. At the same time, errors are also opportunities for feedback that can help the AI learn faster. Before you start designing how your system will respond to errors, try to identify errors that your users can perceive already, or that you can predict will occur.

Errors and failures can be of the following types:

  • User perceived errors
  • System errors that user’s don’t perceive
  • Data errors
  • Input errors
  • Relevance errors

User perceived errors

In traditional software systems, there are mainly two types of errors; User errors (Users make an error. Creators blame the user) and System errors (Errors that arise due to a fault in the system. Users blame the creators.) AI systems introduce a third kind of error where the result doesn’t make sense in the user’s current context. The context can be the user’s preferences, lifestyle patterns or cultural values. E.g. A recipe app giving meat recipe recommendation despite consistently rejecting meat-based recipes. These user-perceived errors are often referred to as context errors. User perceived errors can be categorized as follows:

Context errors
The system is “working as intended,” but the user perceives an error because the actions of the system aren’t well-explained. For eg. a friend’s flight confirmation adds an event on your calendar.
Resolution: These errors can be tackled based on their frequency and severity by changing the system’s function to better align with user expectations. Another way could be to better explain the system’s workings during onboarding.

Fail states
The system fails to provide the right answer or any answer at all due to inherent limitations. For eg. An app that recognizes songs fails to find a song in a tribal language.
Resolution: In such cases, the system can ask the user to provide the correct answer which can be used to train the model.

System errors that user’s don’t perceive

These errors are invisible to users and need not be explained in the interfaces. However, being mindful of these would help train the AI.

Happy accidents
The system flags something as a poor prediction or an error, however users perceive these as an interesting feature. Capitalising on these might often lead to delight. E.g. Asking a smart speaker to water the plants gives a funny response.

Background errors
In these cases, neither the user or the system registers an error despite the system not working correctly. E.g. A search engine returning an incorrect result without being identified. Sadly there’s no easy way out of these error cases and can be tackled through internal testing.

Data errors

Data errors are system-level errors and are of the following types:

Mislabeled or misclassified results
The AI system outputs incorrect results due to mislabeled or misclassified due to poor training data.
Resolution: Allow users to give guidance or correct the data via feedback mechanisms that feed into the model to improve the dataset or alert the team.

Poor inference or incorrect model
The ML model is not precise despite giving sufficient data. E.g. A food classification app returns a large number of false positives for grapes despite giving it a sufficient data.
Resolution: Allow users to give guidance or correct the data via feedback mechanisms that feed into model and helps tune it.

Missing or incomplete data
Cases of missing or incomplete data are often observed when the model reaches the limits of its capabilities.
Resolution: Explain how the system works, it’s limitations and what is missing. Allow users to give feedback on the missing data.

Input errors

Input errors are often user-level errors and are of the following types:

Unexpected or incorrect input
When the user enters an incorrect input often assuming that the system would correct it. E.g. expecting an auto-correct while the system doesn’t correct.
Resolution: Check the user’s input with a range of correct answers and provide a suggestion. E.g. Did you mean this?

Breaking a habit
Change in UI causes a break in the habit of individuals using the feature leading to a different undesired result.
Resolution: Implement AI systems that don’t break habits. This can be done by assigning a location in the interface for low confidence AI output, allowing users to revert to defaults or retain a specific interaction pattern.

Miscalibrated input
When a system improperly weighs the importance of an action or choice. E.g. Liking a specific recipe calibrates the AI to show recipes by the same author instead of similar recipes.
Resolution: Explain how the system works, it’s limitations and what is missing. Allow users to give feedback to correct it.

Relevance errors

Relevance errors are often system-level errors and are of the following types:

Low confidence
When the AI system gives low confidence outputs leading to reduced accuracy. E.g. Hotel price prediction algorithm cannot give accurate predictions due to a changing political climate.
Resolution: Explain why this happened and what can be done E.g. We do not have sufficient information, maybe try again in a few days.

Irrelevance
When the system outputs a high confidence answer that is irrelevant.
Resolution: Allow users to provide feedback to improve the model.

Recap

When designing AI/ML systems, there are three key considerations that product creators need to keep in mind:

  • Building trust with users of the AI system
  • Designing feedback — Helping users improve the AI model
  • Handling errors and failures — What to do when things go wrong
Source: https://woktossrobot.com/aiguide/

Further reading and references

https://woktossrobot.com/aiguide/references.html

Use this flowchart to decide whether you need to invest in AI/ML capabilities for your product.

A compilation of best practices for designers, managers and HCI practitioners to build human-centred AI products.

--

--

Kore
The Startup

Designer | Author of Designing Human-Centric AI Experiences | https://akor.in/