Published in

The Vanguard

7 min readJul 17, 2017

If you haven’t been living under a rock for the last couple of years, then you know that AI is now all the rage and personal assistants– like Alexa, Google Assistant and Siri are having their moment in the sun. If you’re working in or around tech, maybe you’re wondering “how can my company build its own personal assistant / bot / voice interface”. Look no further — as in the next few minutes I will explain the basics of the natural-language-interaction stack, and also where to get it!

Note: While this is a review of voice assistant tech — most messaging bots use much of the same technology.

What Is A Virtual Personal Assistant

Her’s Samantha Is The Quintessential Voice Assistant

Frankly, “Virtual Personal Assistant” is a bastardized term. While the vision is a virtual agent that figures stuff out / does stuff for you, and as such needs to be proactive, make decisions etc., Alexa, Siri and their ilk are actually mostly reactive, i.e. you ask them to do something and then they do it (maybe). So, while general Natural Language Understanding is a very wide subject with many different problems, in this case it is reduced to making requests and receiving responses, in a finite set of “domains” or subjects that the PA needs to handle — e.g. for Alexa mostly playing music, setting reminders, reading the news etc. What the VPA needs to do is understand your command, see if it needs additional information, and then execute it (which may mean retrieve the information) and tell you the results. In the simple case, this process is broken down into a number of consecutive steps — a “pipeline”.

The Natural Language Interaction Pipeline

First, the spoken audio needs to be converted to text — Automated Speech Recognition (ASR).

Then, the text is “parsed” to understand the request (“intent”) and any data (“entities”) by a Natural Language Understanding engine (NLU).

Next, this input needs to be reasoned about — was the user’s request complete or does the VPA need to ask another question? Can it go ahead and execute the request? What is the response or output that is needed? This is also where context comes into play — whether it’s things that were part of the conversation before, or other elements of context like location, time, etc. This is done by a Dialog Manager (DM)

Last, if there needs to be a spoken response, that text needs to be converted to speech — this is called simply Text-To-Speech engine or TTS.

Speech Recognition — ASR

ASR has been around for decades, but over the last few years it’s been quickly improving. This is due to a combination of improvement in hardware (Moore’s Law), an explosion of available data to train / test on (transcribed audio/video on the internet) and lately the realization that neural networks, specifically “Deep Learning” handle this problem really well. What this means is that this technology that was virtually monopolized by one company only 5 years ago (Nuance) is now being offered, in a SaaS model, by every big cloud player — Amazon, Google, Microsoft, IBM and others, and with much better performance. So if your application is cloud-based, you can use such services to take your user’s voice and turn it into text. Cloud-based speech recognition is now widely available and the market is highly competitive, but it’s not really commoditized yet — if your user interacts with the VPA several times a day, the cost per user per year is meaningful.

While the quality of speech recognition is quickly improving, in particular cases — e.g. name recognition, results may not be as good. They can be improved by providing the ASR engine with the context — i.e. what is the user probably going to say next. This requires some smarts on the Dialog Management side — more on that below.

A particularly hairy problem is speech recognition that’s not cloud-based, i.e. local or offline speech recognition. This is useful in various mobile, IoT and Connected Car applications and is a separate topic I may address in another post.

Conversational NLU

The role of the conversational NLU engine (NLU for short) is to take the user’s utterance, already in text format, and detect the Intent and Entity(s).

While historically various techniques have been used, over the last few years relatively simple systems based on supervised machine learning have emerged — usually accessible as a SaaS web service. These include wit.ai (now owned by Facebook), api.ai (now owned by Google), Microsoft’s LUIS and the open-source Rasa. All of them involve the developer / product manager specifying the Intents and Entities, supplying a data-set of examples, training a model and then accessing it as a web service.

For simple applications with a few use-cases or intents, this is pretty straight-forward. But it does get tricky in several situations. Ambiguity is one. As the number of different things the VPA needs to be able to understand grows, language becomes ambiguous and requires more context to be understood. For instance “Play Coldplay” means play music (by Coldplay), but “Play Minecraft” means launch a game. Some NLU engines allow the application to provide context (e.g. “state”) when parsing an utterance, which may help. Often, it’s the Dialog Management layer that needs to sort things out,

Dialog Management

As we’ve seen, off-the-shelf components can be used to take speech (audio) and turn it into structured Intent / Entity data. Next up, there is a need to reason about it — for instance:

If it’s a new Intent — do we have all the Entities needed to execute it? If not — need to (a) remember the Intent and (b) ask the user for the missing details
If it’s an answer to such a request — do we now have all the info needed?
If an Entity received is incomplete or ambiguous — how to resolve it? For instance the user may say “Call Michael” and we need to figure out whether it’s Michael Caine or Michael Phelps, or maybe it’s the Michael whose call the user missed an hour ago?
What if the text wasn’t understood at all, or doesn’t make sense in the current context — maybe providing the ASR or NLU more context would help them correctly understand the speech or text?

etc.

Reasoning is the Dialog Manager’s (DM) job. It needs a model of the relevant conversations, usually some kind of conversation scripts, the ability to manage context data, orchestrate ASR, NLU & TTS, and interfaces to the application actions — say the music player, reminder database, contacts, etc.

Once these external applications or services are involved, they may in turn require more interaction — e.g. ask the user for missing information, tell the user about something that happened — e.g. a push message which may start a new conversation etc. A powerful DM enables easy design / change / testing of diverse conversations and easy integration of different functional app / back-end components. Because the DM is where context is being managed, it is in the best position to cue other elements of the system like ASR & NLU with the context information they need to optimize performance and to manage their operation. This is why DM turns from being a chain in the in the interaction pipeline to the orchestrator of the entire system — the ghost in the machine.

Advanced VPA Requires A Dialog Manager In The Heart Of The System

When an action is concluded or more info is needed from the user, the DM creates a textual response (“natural language generation”- NLG) and sends it over to TTS to turn it to speech (see below). Finally, if this is not a voice-only device but a device with a screen (e.g. Echo Show), a messaging app or even a robot that can move around or roll its eyes, DM also needs to render the visual part of the response.

Current open bot platforms (e.g. ChatFuel, Octane.ai, msg.ai etc.) are rudimentary dialog managers, focused on very specific kinds of interaction (messaging apps) and simple dialogs. Google’s Api.ai combines the NLU engine stack with dialog management, supporting very linear dialogs. Mostly these engines use a state machine model which doesn’t scale well when there is a diverse set of intents / entities and makes it virtually impossible to handle dialogs where the user “surprises” the PA by changing the context of a conversation mid-way, without losing context altogether. Servo’s platform is the only one on the market today that was built from the ground up to handle a rich set of domains / intents, proactive conversations (e.g. push) the use of context across interaction and multiple modalities.

Speech Synthesis or Text-To-Speech

TTS is a relatively mature technology. Traditionally it is based on sampling voice recordings and re-assembling phonemes to create the audio required. Many off-the-shelf SaaS services or software libraries are available on the market, including free, open-source ones. They differ from each other by the quality of the output and the level of customization possible (voice, speed / intonation etc.). Several companies offer services enabling the creation of a unique, “branded” voice using a voice actor (e.g. Acapela Group).

Conclusion — It’s Engineering, Not Voodoo

Recent advances in artificial intelligence enablers have led to great improvement in user experience and utility — at last making Intelligent Assistants tools, not toys. Some industry actors would have you believe that building them involves taking billions of data points and throwing them at some magical deep learning framework to have intelligence arise. In reality, they are a feat of engineering — purposefully architecting together the components in the right way. Some of these components are now widely available (at a price) — ASR, NLU, TTS. Some are still very much the secret sauce “do engines” behind assistants like Alexa, Siri and others.