A UX Lens on AI Model Accuracy and Precision

Published in

IBM Design

10 min readMar 22, 2019

By Michael Zuliani and Paul McInerney

People expect information from an authoritative source to be accurate and precise. And that’s what AI engineers strive to deliver in their AI models. However, for the foreseeable future, AI models will sometimes fall short. This gap presents an opportunity for UX designers.

UX design can improve the usefulness of an AI model by contributing a user perspective to the traditional engineering concerns of AI model accuracy and precision. That is, UX designers can understand users’ goals and propose ways to adjust the model in light of those goals.

How can UX do this? The place to start is engaging in a productive dialog with AI engineers.

Talking to AI engineers about usefulness

UX designers need to dig below the screen and engage with AI engineers on their turf. To do so, UX designers and AI engineers need a lingua franca. Everyday terms like precision and accuracy have specialized meanings to AI engineers and they use terms like output node score that are not understood by the uninitiated. UX designers need to learn just enough about AI concepts to engage in meaningful discussions with AI engineers. For those looking for suggested reading, see the Further Reading section.

However, it’s not a one-way street. UX designers need to help the engineers understand the user’s perspective starting with an everyday understanding of the accuracy and precision of an answer:

Accuracy: Refers to how close an answer is to the correct value.
Precision: Refers to how specific (exact or detailed) the answer is.

In the AI context, the term answer refers to the value output by the AI model given some input.

Next, UX designers need to bring the user perspective to address the accuracy-precision trade-off. While AI engineers strive to maximize accuracy and precision, there comes a point where further improvements involve prioritizing one of these. That is, Option 1 would result in more accuracy while Option 2 would result in more precision. From a UX perspective, this priority should be established based on user needs. In some cases, user needs are best met by prioritizing accuracy over precision. In other cases, the opposite is true. In sum, the UX perspective asks, “What is most useful for addressing this user need” rather than “What results in the best scores for our engineering metrics.”

Finally, to successfully engage with AI engineers, UX designers need to tailor this guidance to the type of AI model being discussed. In the remainder of this article, we’ll do this for three common types of AI models.

Classification models

Is this credit card transaction fraudulent? Does this photo contain a car? What model of car is in this photo? These examples illustrate a common application of AI — classifying items.

Let’s specialize our definitions for use with classification models and compare the perspectives of AI engineers and users:

Accuracy: How close the category selected by the AI is to the correct category.

For AI engineers, classification accuracy is often binary — the classification of an item is either correct or incorrect.

For users, however, it can sometimes be useful to consider multiple levels of incorrectness such as a close miss vs. wildly inaccurate.

Precision: How specific the category selected by the AI is.

AI engineers and users think of classification precision a similar way: high precision means being able to classify an item into a narrow, specific category.

In classification models, a key design decision is how broad or narrow to make the categories. In other words, how precise are the categories and hence how precise are the answers?

This decision needs to be guided by the trade-off between precision and accuracy summarized below:

Next, we’ll see how this trade-off plays out in a case study.

Classification model case study

Classifying expenses is a best practice in personal finance. This leads to insights that help with budgeting and saving.

Imagine an AI that reads personal credit card statements and assigns each expense one of the following categories:

Auto expenses (gas, toll roads, etc.)
Taxi and ride share expenses
Groceries
Restaurant meals
Coffee and snacks

This AI can classify expenses with high accuracy except that it often confuses two categories:

Restaurant meals
Coffee and snacks

Classification models can suffer from a variety of shortfalls. The problem illustrated here is a localized problem of confusion between two categories.

A UX solution to consider is combining the two categories prior to presenting results to the user, as follows:

This solution results in higher accuracy at the expense of lower precision. But, is this better? The answer rests on assessing user needs in terms of these questions:

What value would users gain from increasing the classification accuracy

vs.

What value would users lose from decreasing the precision of the categories?

Eventually, AI engineers should be able to improve the model to eliminate the confusion between these categories. However, with this stop-gap UX solution, users can get value from the model in the meantime. As well, AI engineers have the option to expend engineering effort on higher priority improvements like introducing a category for public transit expenses.

Regression models

Which farming practices result in the best crop yield? What crop yield would result from a particular farming practice? Which job candidate is the best fit for the job? These examples illustrate another common type of AI — determining a quantitative target value. This type of AI is often referred to as regression.

Let’s specialize and discuss our definitions for use with regression models:

Accuracy: How close the quantitative value from the AI is to the correct value.
Precision: How many significant digits are provided in the answer from the AI.

For regression models, there is close correspondence between the technical definitions and the everyday definitions. A key issue with answer usefulness in regression models is that very precise answers (point estimates) are not usually the type of information people find most useful. We’ll explore this point in our case study.

Regression model case study

Predicting when a water pump will fail is something you’d want to do if you were a reliability engineer in a municipal water works department. This information would help you plan what maintenance to perform when.

Imagine an AI that could read data from temperature and vibration sensors then predict when each pump will fail. The raw output for one pump might look like this:

This prediction is certainly precise and let’s assume it’s accurate. Nevertheless, it’s not as useful as it could be.

Regression models can generate highly precise answers like 30 days, 33 days, or even 33 days and 4 hours. On purely mathematical criteria, a hyper-precise answer might be the correct or best answer.

However, consider the needs of our reliability engineer whose question is something like: In which month should I replace the motor in pump 2? How likely is the motor to fail if I wait one month? Two months? Three months?

To address this need, the amount of precision needed is the nearest month. This user gets no incremental value from a prediction expressed to the nearest day or hour.

As well, this reliability engineer would probably be better served by a probability distribution like the one shown below:

A less precise monthly breakdown might be even better. This type of report enables the reliability engineer to gauge the risk of waiting longer to perform the maintenance.

We’ll end with a few related points:

Probability distributions are not well understood by some audiences, so the information design needs to consider what level of detail and sophistication is appropriate for a given audience.
Changing the information displayed is sometimes simply a matter of formatting the output. However, in other cases, it requires changes to the AI model. For this and other reasons …
Discussions of what information users would find most useful are best settled up front before assumptions get baked irretrievably into the AI model.

Evaluating options

Which mobile phone should I buy? Should I sell my investment now or later? These examples illustrate another common application of AI — evaluating options to identify the best one.

This type of AI differs from classification and regression models in several ways. First, there is no single right answer. Which option is considered best is a matter of judgement — a conclusion reached by weighing multiple conflicting factors. And what is considered best depends on the person and situation.

Second, options can be evaluated at multiple levels of detail. The evaluation could:

Identify the best option only, e.g., Option B is the best.
Rank order all the options, e.g., Option B, then C, then A.
Quantify the quality of each option, e.g., Option B is 2.4 times better than Option C.

With this in mind, we can specialize our definitions for the context of evaluating options:

Accuracy: How close the evaluation from the AI is to the correct evaluation.

For any model, we need to state whether the accuracy pertains to: identifying the best option only, rank ordering all the options, or quantifying the quality of each option.

Precision: How specific the value from the AI is.

Like classification models, high precision means being able to evaluate a narrow, specific option.

With this type of AI, the usefulness of the information depends the contextual information provided. Simply presenting the answer, no matter how accurate and precise, is of limited usefulness, as we’ll see in our case study.

Evaluating options case study

When treating a patient, physicians often need to choose among multiple treatment options. An AI could help select the best option for each patient.

Imagine an AI that could read a patient record and evaluate four treatment options. Sample raw output is shown below. The treatments are sorted by score with .99 being the best possible score.

From this raw output, a physician could discern that the AI considers Treatment C the best option because it received the highest score. However, this output is not as useful as it could be because it does not address some things the physician wants to know:

While Treatment C is the “best”, is it “highly recommended” or is it “the best of a bad lot” or is it something in between?
Treatment A also seems pretty good. In practical terms, how much better is Treatment C?

How might this output be presented in a more useful manner in terms of our focus on accuracy and precision? To borrow a phrase, the problem here is that there is too much data and not enough information. That is, the treatment scores (e.g., .79) are more precise than needed but important information is not provided.

One solution is grouping the options into meaningful levels of advisability. That is, a small set of rating categories could be defined such as Recommended and Not Recommended; then options could be assigned one of the ratings.

This coarser-grained (less accurate) rating scheme is more aligned with the reality that treatments C and A are about equally advisable, and either would be effective for this patient.

We’ll end with some further UX considerations for this type of AI:

The set of options needs to be designed. Issues include: How narrow should they be? Should any options be excluded — if so, how should this be communicated to users?
The rating scheme also needs to be designed. An alternative to the one above could be a 4-level scheme like: Best Option | Recommended | Acceptable | Not Recommended.
Using a rating scheme requires an additional layer of intelligence. This layer converts the raw precise score (.79) into a less precise but more useful rating (Recommended).
There’s a subtle but important difference between evaluating a list of options and recommending an option. In our case study, the final treatment selection would be left to the physician and patient who would decide based on criteria not considered by the AI model. For example, the physician might choose a treatment because they have more experience with it.
Besides the overall score, the evaluation model likely also has additional details about the treatments that could be displayed to help the physician make the final selection, such as a rating for side-effects.

Conclusion

We’ve seen that useful AI information arises not only from technical quality (high accuracy and precision) but also from user experience. UX can mitigate the inevitable technical limitations and provide useful accuracy and precision. To do this, UX designers need to consider accuracy and precision in light of user needs and in the context of the particular type of AI, be it regression, classification, or something else.