Tales from the world’s most-used chatbot: Microsoft Xiaoice’s lead scientist on cutting edge generative chatbot design

Published in

Speaking Naturally

6 min readNov 12, 2018

If you like this piece, CognitionX recently launched a weekly news briefing on natural language technology. Sign up here

While EMNLP ’18 in Brussels was really and deliberately dominated by machine translation, buried in the 228 page handbook was a talk by the lead scientist of what has to be the most innovative chatbot in the world today.

Wei Wu’s 152 page tutorial was this:

Wu’s recent achievement with XiaoIce team is launching a fully generative chatbot in Indonesia with full dialogue generation technologies. The chatbot now has more than 1.5 million users on LINE Indonesia.

Xiaoice is the most commonly used chatbot in the world.

XiaoIce, pronounced “shi-OW-ice”, can be thought of of as a brand for multiple products, a little like the IBM Watson brand. As such, it’s the engine that drives over 500 MILLION users worldwide, currently mostly in Asia.

Yes. Xiaoice is a HALF A BILLION USER chatbot.

Xiaoice is at least a year or two ahead of Western chatbot platforms like Alexa and Google Home in many respects.

Oh and did I mention that Xiaoice is light years ahead of any Western chatbot? I watched the whole 90 minute Xiaoice 6 launch in September. Here are some YouTube jump-tos:

Xiaoice story factory: reads stories, personalised for the listener, with morals.
Xiaoice’s amazing singing voice
Xiaoice’s inquisitive chatbot engagement
Xiaoice giving directions (also shows Xiaoice’s jokey attitude)
Xiaoice multiturn / no repeated wake words on the Xiaomi Yeelight Dot
Missing phone. Calls right one based on voice id
Age-restricted smart home access. Again with sass. “You’re not old enough”… “ok if you are then is a whale a mammal or a fish?”
More Xiaoice sass. “Why this song again Xiaoice?… because I’m brain-washing you ha ha”

And this chatbot tutorial goes into real depth as to how it works.

Knowing how this works and what they’ve learned… this is chatbot gold! But I couldn’t make it to Brussels. Having been in Brussels just two weeks earlier, my waistline could not cope with more amazing beer, chocolates and waffles. (And I was busy. And I didn’t know about the conference.)

A bit of sleuthing I found the slides. The 152 page PDF can be found here. But this post is an attempt to summarise it in a nontechnical way. So here goes.

BTW I have talked with Microsoft through a few channels so some of this information comes directly from them rather than from the tutorial.

Deep Chit-Chat: Deep Learning for ChatBots

By Dr Wei Wu and Dr Rui Yan

Socialbots vs Taskbots

Xiaoice is first and foremost a “socialbot”. Designed for long-term empathy, trust and companionship. Compare these against taskbots:

From CognitionX’s Business of Natural Language Computing: a primer on chatbots and voicebots (90% of reviewers strongly recommended it to colleagues) Download sample

Social bots then:

Chit-chat: casual and non-goal oriented
Open domain: the topic of the conversation could be anything. Users
may jump from topic to topic in the conversation
Relevance & Diversity: the research focuses on automatic replying with
relevant (to the context) and diverse (informative) responses to make conversations engaging

The Xiaoice socialbot has 5 variations. Note the 245m user number conflicts with the 660m user number Microsoft has cited elsewhere.

Microsoft is racing against Amazon, Google and Apple to be the primary way that people access knowledge and services in the home and personal life. Below shows how Xiaoice draws in from services when it makes sense.

The tutorial starts getting really hands-on, and to keep this accessible to nontechnical audience, I’ll cover the high-level themes.

The rest of the tutorial: state-of-the-art conversations

The rest of the tutorial talks about the different parts of state-of-the-art socialbots. Socialbots are chatbots that can talk about anything (rather than a taskbot — see diagram above). It gets very technical so I will attempt to summarise to the best of my fairly limited understanding of the domain.

The summary of my summary

My interpretation of the workshop is that it covers the main topics, along with what their state-of-the-art success looks like with trillions of samples of data to learn from:

How conversational chatbots are harder than simple Q&A
How having a system create responses is harder than pulling canned ones from a database but they’re having some success with this
Good progress being made on emotions, personality and common-sense.

The components

The following concepts are summarised first; my definitions are pretty loose but hopefully you get the idea:

Ways of preparing words and sentences for machine learning: Word2Vec (CBOW for “find words likely to be near word in question” & Skip-grams “find the most likely word next to these words), GloVE and FastText
Ways of efficiently finding patterns, traditionally with images but work with text too: Convolutional Neural Networks aka CNN
Ways of efficiently finding patterns over time and help with memory: Recurrent Neural Networks aka RNNs, including ways of helping understand context — Long-Short-Term Memory (LSTM)s, and Gated Recurrent Units (GRUs), Seq2seq for creating new text automatically and improvements to context through “encoder-attention-decoder” structures

Notably:

The components are considered standard machine learning components today.
All of the applications of these are based on academic research papers no more than 4 years old.

Responding to messages

There are two main ways for a chatbot to respond:

Retrieval of previous answers (also sometimes called extraction, or pre-baked)
Creating / synthesising new ones (generative)

Both are covered, clearly delineating between:

Question and answer-style systems: you ask a question, you get a response; no memory of previous discussion required (“single-turn”)
Conversational systems: much harder than Q&A-style, these “multi-turn” systems also need context. And this is really hard.

Other aspects of conversational richness

The workshop also covers other pretty interesting human-like concepts that are live today in several Xiaice socialbots:

Topics: finding what’s being talked about
Emotions: classification, memory, expression
MojiTalk: using emojis in conversation
Personas (personality assignment). Consistency is really critical. My personal experience with Microsoft Zo is that it keeps changing where it’s from so isn’t consistently consistent.
Then a bunch of other topics: Knowledge graph-based conversations, reinforcement learning in conversations (hi Tay!), when to involve human-in-the-loop, and generative adversarial networks (GANs) in NLP and conversation. Finally this year there was introduction of a “common sense model”.

Evaluation metrics

You need evaluation metrics to determine how good things are. A lot of the industry use BLEU and ROUGE metrics that have their roots in language translation (sequence of words in one language to another sequence of words possibly of different length). However for conversational elements that are inherently hierarchical, this doesn’t work as well apparently. This workshop covers a couple of new metrics, ADEM and RUBER.

What next

UPDATE: Dec 21 2018 a new article on Xiaoice was published: “The Design and Implementation of an Empathetic Social Chatbot”
Listen to the world’s first public audio conversation with Microsoft Zo
Keep up to date on news about AI for speech and text with our weekly briefing called Speaking Naturally. Sign up here