Multimodal UIs for People’s Actual Needs: MLUX x Clinc x Capgemini

Published in

Machine Learning and UX

4 min readDec 13, 2019

Best Practices for Multimodal User Interfaces

Alex Shye and Himi Khan from Clinc sharing Conversational States on slides.

Thanks to recent advances in artificial intelligence and machine learning, multimodal user interfaces — for example, using both text and speech input — are increasingly practical.

However, design and deployment of these systems remains a significant challenge. Himi Khan and Alex Shye from the Clinc conversational AI platform shared their expertise and insights with us!

Put User Needs First

Like all complicated and AI-based UIs, starting with the needs of the users is critical for multimodal UIs. It is first important to confirm that a multimodal UI is useful in the first place. For example, if the UI is to be on a smartphone, then speech recognition would very likely be desirable. However, where do you go from there?

For example, Clinc uses a three-step process:

User experience — what are the users’ problems? What kind of technology do we want to build?
Technology — what models do we need to build? How to hook them together? Plus other technical problems that need to be solved.
Productization

Support “Messy” Input

In general, people don’t speak like a structured grammar used for traditional speech recognition. For example, to order a pizza, it is somewhat unnatural to explicitly say: “Order a pizza of medium size with the following toppings: mushrooms, sausage.” It’s more natural to say something like “I want a medium pizza with mushrooms and sausage.” Furthermore, many people use slang, abbreviations, and filler words (“um”, “uh”) which a system should detect and be able to either use or filter out. When using text input, an additional barrier is misspellings and other typos. Although a strict grammar would not detect these inputs, a conversational interface should.

The Clinc platform uses an unstructured approach to detect statements for, for example, ordering a pizza. A statement that would begin such a conversation is detected based on a large number of example ways to order pizza, crowdsourced from a system like Amazon’s Mechanical Turk. The resulting examples, ranging from “I would like to order a pizza” all the way to “pizza please.” Even if these statements are preceded or surrounded with other statements about how the user is hungry, the platform detects only the important parts of the statement.

Keep the Context

Having to repeat a statement to a system can get very tiresome. For example, if you forgot a topping to the pizza, repeating that you would like a medium pizza with mushrooms and sausage seems unnecessary. Instead, such systems should keep the context of the user interaction across inputs. Then, ideally, saying something along the lines of “add sausage” should be sufficient.

Once a conversation begins with a detected statement as described above, Clinc’s platform supports this kind of context by modeling such a conversation as a graph, where there are “slots” that need to be filled by the user’s input. For example, the size of the pizza and the toppings on the pizza might be slots that, if not mentioned in the beginning statement, need to specified before the pizza can be ordered. For example, after asking for a pizza, user might say “make that a large” to specify the size. Similar to the introductory phrases described above, crowdsourcing is also used to label possible follow-up statements to map values to the slots. For example, “large” is a size, and “sausage” is a topping. Altogether, these features allow Clinc platform bots to not only understand messy language, but to interact with users across multiple steps by saving context throughout the conversation.

Big thank you to our speakers and sponsors for sharing their expertise with us!

Himi Khan: Head of Business Development & Partnerships at Clinc

Alex Shye: Product and Engineering Lead at Clinc

Karen Chiang: Senior Product Consultant at Talview

Thank you to Capgemini and their Applied Information Exchange for sponsoring this event!

About the Machine Learning and User Experience (“MLUX”) Meetup

We’re excited about creating a future of human-centered smart products, and we believe the first step to doing this is to connect UX and Data Science/Machine Learning folks to get together and learn from each other at regular meetups, tech talks, panels, and events in the SF Bay Area, NYC, and coming to Seattle.

Interested to learn more? Join our meetup, be the first in the know about our events by joining our mailing list (December 2019 newsletter), watch past events on our youtube channel, and follow us on twitter (@mluxsf) and Linkedin.