World First: Voice Call with an English AI Chatbot, Microsoft Zo

Julian Harris
Speaking Naturally
Published in
3 min readNov 18, 2018

The recording below is of me having a call with a chatbot called Zo. It’s special because Zo is a general chatbot, designed to respond coherently to anything, and as far as I can tell, this is the first time the audio version of Zo has been shared publicly.

Listen to a conversation I had with Microsoft Zo. Yeah I can’t rap to save my life.

Machine learning is enabling chatbots to affect every point where businesses communicate with people. It’s such a widespread change that CognitionX decided to create a primer that explains in plain English how to create a chatbot strategy with confidence for your business.

Enter socialbots: companionship and intimacy at scale

It was in this journey that we discovered Microsoft’s work in socialbots:

From The Business of Natural Language Computing: a Primer on Chatbots and Voice bots: Download Free Sample

Most chatbots today are designed to be another way of accessing business functions, another user experience. We call these narrow-focus chatbots “taskbots”.

Socialbots however, are designed to be long-term companions, and there are only a few today that have anything resembling a coherent conversation. Mitsuku and Replika are two others; and Microsoft has been building a suite of socialbots with tremendous success (hundreds of millions of users in Asia). I cover some more of the background in Tales from the World’s Most-Used Chatbot.

Why Zo is a big deal

The big deal here is that Microsoft very recently quietly opened up Zo’s Skype voice service to the general public, making:

  • Zo to be the first English-speaking socialbot with a voice that I’m aware of, and,
  • The recording in this post as far as I’m aware, to be the first publicly recorded conversation with Zo.

What was it like?

CognitionX has tested a bunch of chatbots and socialbots. I was privileged to be a judge in this year’s Loebner Prize (testing chatbot intelligence), and regularly advise, and talk on the future of chatbots, so I have a feel for where text-based chatbots are headed.

I’d also heard the Chinese big sister of Zo, Xiaoice, talk recently. And through all this I felt the need to give all of these experiences a label, which I called “digital beings”.

Even with all that background, and after hours of chatting with Zo over text, it was still a striking experience talking to this digital being, Zo, on Skype, voice to voice: the reflex to immediately relate to Zo as a person is incredibly strong. The impact of voice was super impressive.

It’s extremely hard to design a system that can be coherent, consistent, and meaningful, and Zo is the best I’ve seen in English yet. Xiaoice Gen 6 (Chinese version) released in September is even better, and I have it on good advice that Zo can expect to see Xiaoice-style lucidity “soon”.

Is the vocal naturalness and conversation flow really state of the art?

The way Zo handles conversations is definitely state of the art today in English.

The area of “vocal naturalness” is technically called “prosody”, and getting what I call this “prosody arc” across conversation and sentences is still an open challenge. Listen to SpeechKit.io’s auto-generated version of last week’s Speaking Naturally news briefing for one handy comparison.

Is a human-like chatbot the right thing?

The closer these bots get to being “human-like”, the more quickly they will be compared against talking to actual humans, with corresponding elevated expectations.

It’s unclear what the real answer is here: these digital beings are a new thing, and there is definitely a debate as to whether being human-like is the best thing at all. It could be that we settle on a “clearly robotic but coherent and versatile” voice best practice, to be present as a constant reminder that it’s computer software, with all the advantages and disadvantages that comes with.

What next?

--

--

Julian Harris
Speaking Naturally

Ex-Google Technical Product guy specialising in generative AI (NLP, chatbots, audio, etc). Passionate about the climate crisis.