Building Conversational AI agent — part 2 How does conversational AI work?

Gal Lellouche
NexC.co
Published in
8 min readJan 3, 2021

Intro

After getting questions, and comments that I jumped before I explained the basics, I restructure the series and this is now part 2. This is a piece from the series about nexC journey with conversational AI.

In this part, I will explain the basics you need to understand in order to design & build a conversational agent.

Second Part — Introduction to conversational AI

This part will review the basics concepts and phrases related to the field of conversational AI. My goal is to expose you to the basic knowledge required to make a good decision (part 1 and 3) and understand your team goals and language.

Conversational AI is built on one of the 2 concepts:

  1. End to End AI model — This is mostly for chit chat or in the research stage. Known implementations are Meena, Blender, and Mitsuku.
  2. Conversational AI pipeline (couldn’t find any official definition, LMK if you found) — usually a 3 steps pipeline built from NLU(Natural language understanding), dialogue manager, and NLG(Natural language generation)

I will start with the most common and used concept, and by the end, I will review the E2E models.

the basic of conversational AI — Natural Language Processing

Natural Language Processing (NLP) — is the technology used to aid computers to understand humans. It’s a branch of AI that deals with the interaction between humans and machines using natural language. The objective is to read, decipher, understand, respond, and make sense of the human language. NLP is a huge name for a lot of tasks, which some will be reviewed later in this article, such as — information retrieval, translation, sentiment analysis, entity recognition, and more.

If you have no previous knowledge in this area, you can continue but I will highly recommend reading some more about it.

I will also answer the $$$ question and give you a head start for part 3, you can build a complete agent of level 1 and 2 only by understanding the concepts in this article. But if you want to go further, and extend the limits you must have the right knowledge or the right NLP team.

Conversational AI pipeline

First, we must split the parts between the conversational AI and the conversational UI (do not mix with conversational experience which is a result of both). Conversational UI is the interface between the user and the machine. For example, in most sites, it will be the little widget with the text powered bot, and in smart home appliances, it will be a voice-powered assistant such as Alexa.

The UI can be custom or using common channels such as Facebook Messenger, Slack, Whatsapp, and more. This UI can use voice or text for input or output. Since it’s supposed to be an independent part of the architecture, most of the frameworks and bots will just use transcription service and voice synthesizer in order to normalize the input and output.

A word from us — it can work for some solutions but for others — people interact differently in each channel or input type and sometimes it requires different conversational agents to handle them.

An architecture of a conversational AI system

The main pipeline parts

In order to understand the flow let’s process a message:

  1. The user writes in the chatbot — “I’m looking for a laptop”
  2. The channel sends the message to the Backend
  3. The backend sends the text into Natural Language Understanding(NLU) process — which translates the text into structured data
  4. The structured data is an input to the Dialogue manager that decides which action to perform.
  5. The action results in structured data that is being sent to the Natural Language Generation process, which turns it back into plain text.
  6. This text is sent back to the UI as a response.
  7. The user gets a message — “Great, let’s help you find the best laptop…”

Understanding this pipeline is 70% of understanding conversational AI architecture.

Now let’s dive into the components.

NLU — Natural language understanding

NLU is a subset of NLP that contains all the missions that are relevant for understanding the human language. NLU encompasses one of the narrow but especially complex challenges of AI, transforming unstructured input into structured data.

A good example can be:

Text: “Alexa set a reminder to call mom tomorrow at 9 am”

NLU result:
1. Intent: set a reminder
2. Entities:
Reminder: call mom
Date: tomorrow at 9 am (4.1.2021 9:00)

In this simple example I used the 2 most common parts of the NLU process:

  1. Intent classification — this is a simple text classification model that classifies the input text into specific user intent. There are agents that handle Multi intents in the same sentence which increases dramatically the complexity of this model.
  2. Entities extraction (NER) — this is a task of identifying entities from the input. The idea is to identify the parts of the text that influence the required action, for example, the date of the reminder and its content.

In some systems, you will have to add more models to the NLU process, such as sentiment analysis, and Part of speech(POS) tagging. But the Intent and entity are the basics of most of the conversational agents.

In order to build a good agent, you will have to build a dataset of texts classified with intents and entities to feed into the models training step.

Annotate text to intent and entities with Rasa framework

Dialogue manager

A dialogue manager (DM) is the component that is responsible for saving the current state and turning the NLU result in an action. Not like the NLU that will give the exact results for the same input no matter the conversation state, the DM will translate these answers according to the current conversation.

The most important part of the DM is state management. The state is the data that represents the current knowledge the agent has about the current conversation. The state will contain the conversation history, and some predefined variables(such as current_product that we save if the user is currently looking or talking about a specific product).

In some frameworks (part 3) and models, the DM will process as part of the NLU, but you should be aware of the difference since this has a high impact on the ability to build a really contextual agent (level 3 and higher). For example, let’s assume the user is in the middle of a quest to search for a laptop. If he suddenly wants to dive into a specific product and ask a question about it, the system must understand the current state, otherwise, it will get into the loop of “I can’t understand your answer”.

The DM result is an action to execute. For the previous example, it will be- set_a_reminder(“call to mom”, “4.1.2021 9:00”)

Action server

The action server is a simple Backend service that receives an action to commit from the DM and returns a structured response. In most of the managed services, it will be the webhook connection to fulfill the required action. The server allows you to connect your agent to your company knowledge, It’s the main gateway between the conversational service to the company services.

Natural Language Generation (NLG)

NLG is another subset of NLP. it’s the “process of producing meaningful phrases and sentences in the form of natural language.”, as defined by Artificial Intelligence: Natural Language Processing Fundamentals. If NLU is the reading, then NLG is the writing. In our area, it is mostly the process of transforming structured data into a human language response.

For example — after setting the alarm, the response:
alarm(content=” call to mom”, date=”4/1/2021 9 am”)

translated to: “ok, I’ve set a reminder for tomorrow at 9 AM”.

Most of the available frameworks today don’t contain a real Language generation model, but a rule-based one. A rule base NLG works by defining entities to be replaced by the structure data, such as:

  1. “Ok, I’ve set a @action @date”
  2. “Sorry, I couldn’t set a @action to @date”

Summarizing the Pipeline

After understanding all the parts, let’s repeat the most common flow with the new concepts.

  1. The user insert input, that is being translated into plain text
  2. The system inserts this text to the NLU that returns structured data containing the user intent and the entities.
  3. The DM gets the NLU data with the current state and transforms it into action to the backend.
  4. The response from the action is being translated back into plain text using the NLG model.

That’s all the basics you need to understand to design and build your first agent.

As an appendix I will add a short review of E2E systems, although it’s most common in the academy, we did implement some of the concepts to our models and design.

E2E models

End to end models is AI models that do the entire pipeline in a single model. As I mentioned before, they are used mainly for chit chat, and in research. The main reason they are less common is the need for a large amount of data to create a good domain-specific conversational model.

In most cases, you will find them as chatbots that impersonate using texts from tv shows or books, like Chandler from Friends.

The concepts are usually the same but implemented with completely different AI models. Usually, the NLU part is being called the encoder part, and the NLG the decoder. Because these parts transform the input and output of the human language.

End 2 end Conversational AI model (with Reinforcement learning as DM)

The implementation can be relatively simple using pre-trained models such as GPT-2 or seq2seq. The newer and more initiative researches are a huge model built just like the conversational AI pipeline:

An example that we found very interesting and informative is the HRED model and its improvement in the KBRD. We also use a Reinforcement learning model as part of the DM (like in the illustration), and I might elaborate on it in the following parts.

Summary

That’s it, we covered most of the technical concepts you need to speak the language. The main points:

  1. NLU — extract intents and entities from text,
  2. DM — using the conversation state and with the intent and entities to predict the next action,
  3. NLG — generates the response to the user.

OK, that wraps up Part 2. Thanks for sticking around!

Ok, that wraps up part 3. Thanks for sticking around!

If you learned something useful, read the other parts:

  1. part 1 — What am I solving?
  2. part 2 — You are reading!
  3. part 3 — how to choose conversational framework?
  4. part 4 — Rasa 101(In planing)
  5. part 5 — Customizing the agents(In planing)

If you want to add your thoughts on the topic or want to ask some questions regarding it, feel free to write your comments or contact me directly.

--

--

Gal Lellouche
NexC.co
Editor for

Building a Hyper-Personalized AI solution for Legal Professionals | Entrepreneur, Founder and CTO. Founder of nexC and Task Sheriff(acquired by Sage)