Design for AI: How algorithm choice affects UX

Mark Bailey
Jan 31, 2018 · 7 min read

Machine learning is one of the strategies many companies are using as they try to improve and differentiate their product by augmenting the user’s experience. Building artificial intelligence into apps can improve what can be done for customers and users, but can also affect how customers and users interact with the machine. In this article I am covering how, as the designer, you can help improve the decision process; even when it is the engineers are deciding on which machine learning algorithm to use.

photo courtesy of Sarah Pflug

BACKGROUND RESEARCH

Like every other product you must first know who your competition is. A competitive analysis will tell you what other companies are also working in this space. Since machine learning is on the bleeding edge of what is coming you will also want to check research papers being published as competitions. What is a research paper or conference topic right now turns into a product very quickly. The people presenting these papers are being hired by companies as a cost saving measure of not needing to buy up a startup company that the person creates upon graduation.

Knowing the competitive offerings allows you to push back if the development team is pushing for an uncompetitive path. I have known a billion dollar retailer to go with a older less accurate rule based language processor instead of the deep learning language processor that has become the standard over the last few years. The developers recommended what they were familiar with and there was no push back from the product manager or UX because they were not familiar with the space. Not surprisingly the product is stumbling and having a hard time competing.

The next area to look at is the users. The first question to ask is AI necessary? If you are still reading I am guessing it is. Make sure you have some base set of users that were tested without an AI though so in the off chance that the AI tests as taking longer or lower user satisfaction you can show the difference so the AI can be improved or scrapped.

Also, because this area is changing so fast a lot of the time users being interviewed and tested will be part of the early adopter group. They are happy to put up with a lot of annoyances the general population users you are trying to expound for do not exist yet. For new interfaces, like AI, there are not a lot of best practices and heuristics to fall back on; so one of the most basic heuristics applies.

“The less the user needs to change their behavior and understanding, the better”.

Even early adopters are happier when they don’t need to learn something new, so to minimize the changes required for the user to adapt to the software, you can’t overstate the importance of building up accurate personas and journey maps. Once you know what the user’s goal is and what is influencing their decisions to reach that goal then you can help the engineers with the algorithm choice.

DESIGN REQUIREMENT MEETINGS

Now that you know your competitors and users how does that affect the algorithm choice? Algorithms for machine learning are split up into two main groups ‘Deep Learning’ and ‘Shallow algorithms’. Deep learning is working from tons of data, so even after it is optimized, it can be much slower than any number of specialized algorithms. If it must be smart, then slow might be OK. However, if it is just an improvement of one step in a process being done by millions of customers a day then speed might be more important and a specialized shallow learning algorithm might work better. That being said if the algorithm is spitting out useless info then you might need some videos of users cursing at their screen to convince product managers it is worth spending the extra money to allow for deep learning.

Within the shallow learning algorithms there are three groups: Supervised learning, Unsupervised learning and reinforcement learning. Supervised learning is what is used most of the time. The training dataset to use is one that you can split into different groups. There needs to be some set of data that you already know the answers to that the different algorithms can be judged how accurate they are at guessing the right answer. Unsupervised learning is if you don’t know what is in the dataset. Most of the time this is to just group the dataset into sets with similar items. Reinforcement learning is trying to maximize or minimize something.

Deep learning is similar to reinforcement learning in that a goal is trying to be reached. The difference is that with the extra data deep learning can make predictions on the outcome. Within Deep learning there are two groups. Rule based learning is how it sounds. Rules are input into the machine before it is trained on the data. This type is favored by OpenAI and how they beat the Go champion. Reactionary learning is similar to instinct. This is relying more on the machine to come up with its own reaction more. Google depends on this style more.

Speaking of how accurate the algorithm is: does it need to be mostly accurate all of the time? Or, can it be very accurate most of the time, but when it is wrong, it is really wrong. The algorithms can be adjusted differently for those two different circumstances.

One thing to make sure of is there is not a rush to use a specific learning algorithm. Even for specialized algorithms, like making recommendations, multiple algorithms that work better for different situations exist. So make sure to document the situation and influences as much as possible for the user journey.

When choosing an algorithm, there are two things that need to be measured. The first is precision: when the algorithm makes a guess how accurate is it? The second is recall or prevalence: out of the total items you want found, how many did it find? Depending on your user’s goals either one can be more important than the other. For example if you are detecting spam it is better to have a high precision threshold and lower recall since it is better that a few spam messages end on in the mailbox than a real email ends up in the spam. But if you are detecting a disease from a medical scan then it is better to have a low precision and high recall, because it is better to have a few people need to be retested than to have someone slip through the with the disease. The F2 measurement is a test which gives a number that is basically a slider is between the precision and recall. This is what you will want to define for the algorithms to achieve.

COMPARING ALGORITHMS

There is a good chance the developers will come up with a few algorithms that kind of work. A lot of AI is still at the point of throwing a lot of stuff at the wall and see whats sticks. They might come to you with multiple algorithms to compare.

A normal machine learning process is three steps. The algorithm takes some data (with known answers) and predict an outcome (without access to those answers). It then uses that prediction to compare to the answers and measure the error/loss. From that measurement/grade it ‘learns’ by adapting itself. Then it goes through the whole process over and over. It would seem the easiest way to choose an algorithm is to compare them based on their measurement/grade they are giving themselves. The problem is the grading process inside each algorithm is customized, so not comparable.

The best way to test algorithms against each other is using a ‘Confusion matrix’. Think of it as kind of the final test at the end of the term for the algorithms. (It does not need to be big, usually the data is split 60% training, 20% validate, and 20% testing.) The normal data sets used to train an algorithm are split between the training and then validating. But just like students can learn to study to a test instead of learning the material; algorithms can learn to pass only the validation data set. Using a final set of data the algorithms never get to see during training makes sure this does not happen. It also allows for you to compare which of different algorithms best achieve your goals since they are all being tested on the same data and you can specify the requirements to measure.

Remember this is the testing developers will be (or should be) doing. Their goal is to get it as close to a theoretical point. This is the point you will be pushing them to reach. This point is the goals you discovered the users are trying to reach.

CONCLUSION

This is just scratching the surface of problems you can run into. There are so many different areas of AI I can only cover some of them. Not to mention everyone is working on a customized version of an algorithm to get it to work for their own specific needs. If you come across a design problem or solution with AI, you are free to contact me and let me know so I can help or get to word out about good solutions.

Mark Bailey

Written by

UX research and design specializing in AI and machine learning apps. Portfolio at http://DesignForAI.com