Chatbots with Machine Learning: Building Neural Conversational Agents

Using machine learning approaches to build smart chatbots

Dmitry Persiyanov
Cube Dev
12 min readSep 12, 2017

--

Have you ever talked to Siri, Alexa, or Cortana to set up an alarm, call friends, or arrange a meeting? Many people may agree that despite their usefulness in common and routine tasks, it’s difficult to force conversational agents to talk on general, sometimes philosophical topics.

The Statsbot team invited a data scientist, Dmitry Persiyanov, to explain how to fix this issue with neural conversational models and build chatbots using machine learning.

Interacting with the machine via natural language is one of the requirements for general artificial intelligence. This field of AI is called dialogue systems, spoken dialogue systems, or chatbots. The machine needs to provide you with an informative answer, maintain the context of the dialogue, and be indistinguishable from the human (ideally).

In practice, the last requirement is not reachable yet, but luckily, humans are ready to talk with robots if they are helpful, sometimes funny, and interesting interlocutors.

There are two major types of dialogue systems: goal-oriented (Siri, Alexa, Cortana, etc.) and general conversation (Microsoft Tay bot).

The former help people to solve everyday problems using natural language, while the latter attempt to talk with people on a wide range of topics.

In this post, I will give you a comparative overview of general conversation dialogue systems based on deep neural networks. I will describe main architecture types and ways to advance them. Also, there will be a lot of links to papers, tutorials, and implementations.

I hope this post will eventually become the entry point for everyone who wants to create chatbots with machine learning. If you read this post till the end, you will be ready to train your own conversational model. Ready?

Go ahead :)

I’m going to refer to recurrent neural networks and word embeddings, so you should know how they work in order to easily follow the article. For those who need to refresh their knowledge, I’ve prepared great tutorials at the end of the article for you.

Generative and selective models

General conversation models can be simply divided into two major types — generative and selective (or ranking) models. Also, hybrid models are possible. But the common thing is that such models conceive several sentences of dialogue context and predict the answer for this context. In the picture below, you can see the illustration of such systems.

Throughout this post, when I say “network consumes a sequence of words” or “words are passed to RNN,” I mean that word embeddings are passed to the network, not word ids.

Note on dialogue data representation

Before going deeper, we should discuss what dialogue datasets look like. All models described below are trained on pairs (context, reply). Context is several sentences (maybe one) which preceded the reply. The sentence is just a sequence of tokens from its vocabulary.

For better understanding, look at the table. There is a batch of three samples extracted from raw dialogue between two persons:

- Hi!
- Hi there.
- How old are you?
- Twenty-two. And you?
- Me too! Wow!

Note the “<eos>” (end-of-sequence) token at the end of each sentence in the batch. This special token helps neural networks to understand sentence bounds and update its internal state wisely.

Some models may use additional meta information from data, such as speaker id, gender, emotion, etc.

Now, we are ready to move on to discussing generative models.

Generative models

We start with the simplest conversational model, based on the paper “A Neural Conversational Model.”

Illustration source

For modeling dialogue, this paper deployed a sequence-to-sequence (seq2seq) framework which emerged in the neural machine translation field and was successfully adapted to dialogue problems. The architecture consists of two RNNs with different sets of parameters. The left one (corresponding to A-B-C tokens) is called the encoder, while the right one (corresponding to <eos>-W-X-Y-Z tokens) is called the decoder.

How does the encoder work?

The encoder RNN conceives a sequence of context tokens one at a time and updates its hidden state. After processing the whole context sequence, it produces a final hidden state, which incorporates the sense of context and is used for generating the answer.

How does the decoder work?

The goal of the decoder is to take context representation from the encoder and generate an answer. For this purpose, a softmax layer over vocabulary is maintained in the decoder RNN. At each time step, this layer takes the decoder hidden state and outputs a probability distribution over all words in its vocabulary.

Here is how reply generation works:

  1. Initialize decoder hidden state with final encoder hidden state (h_0).
  2. Pass <eos> token as first input to the decoder and update hidden state (h_1)
  3. Sample (or take one with max probability) first word (w_1) from softmax layer (using h_1).
  4. Pass this word as input, update hidden state (h_1 -> h_2) and generate new word (w_2).
  5. Repeat step 4 until <eos> token is generated or maximum answer length is exceeded.
Reply generation in decoder, for those who prefers formulas instead of words. Here, w_t is the sampled word on time step t; theta are decoder parameters, phi are dense layers parameters, g represents dense layers, p-hat is a probability distribution over vocabulary at time step t.

Using argmax while generating a reply, one will always get the same answer when utilizing the same context (argmax is deterministic, while sampling is stochastic).

The process I’ve described above is only the model inference part, but there is also the model training part, which works in a slightly different way — at each decoding step, we use the correct word y_t instead of the generated one (w_t) as the input. In other words, at training time, the decoder consumes a correct reply sequence, but with the last token removed and the <eos> token prepended.

Illustration of decoder inference phase. Output at previous time step is fed as input at current time step.

The goal is to maximize probability of a correct next word on each time step. More simply, we ask the network to predict the next word in the sequence by providing it with a correct prefix. Training is performed via maximum likelihood training, which leads to classical cross-entropy loss:

Here, y_t is a correct word in reply at time step t.

Modifications of generative models

Now we have a basic understanding of sequence-to-sequence framework. How do we add more generalization power to such models? There are a bunch of ways:

  1. Add more layers to encoder or/and decoder RNNs.
  2. Use a bidirectional encoder. There is no way to make the decoder bidirectional due to its forward generation structure.
  3. Experiment with embeddings. You can pre-initialize word embeddings or learn them from scratch together with the model.
  4. Use a more advanced reply generation procedure — beamsearch. The idea is to not generate a reply “greedily” (by taking argmax for the next word) but consider the probability of longer chains of words and choose among them.
  5. Make your encoder or/and decoder be convolutional. Convnets might work much faster than RNNs because they can be parallelized efficiently.
  6. Use an attention mechanism. Attention was initially introduced in neural machine translation papers, and has become a very popular and powerful technique.
  7. Pass the final encoder state at each time step to the decoder. The decoder sees the final encoder state only once and then may forget it. A good idea is to pass it to the decoder along with word embedding.
  8. Different encoder/decoder state sizes. The model I described above requires the encoder and decoder to have the same hidden state size (because we initialize the decoder state with the final encoder’s state). You can get rid of this requirement by adding a projection (dense) layer from the encoder final state to the initial decoder state.
  9. Use characters instead of words or byte pair encoding for building vocabulary. Character-level models are worth considering as they work faster because of a smaller vocabulary and they can understand words which are not in their vocabulary. Byte Pair Encoding (BPE) is the best of both worlds. The idea is to find the most frequent pairs of tokens in a sequence and merge them into one token.

Problems with generative models

Later, I’ll give you links to popular implementations so you can train your own dialogue models. But now I’d like to warn you of some common problems with generative models you can face.

Generic responses

Generative models trained via maximum likelihood tend to predict high probability for general replies such as “Okay,” “No,” “Yes,” and “I don’t know” for a wide range of contexts. There are some works dealing with this problem by:

Reply inconsistency / how to incorporate metadata

The second major problem with seq2seq models is that they can generate inconsistent replies for paraphrased contexts but with the same sense:

Illustration source

The most cited work dealing with it is “A Persona-Based Neural Conversation Model.” Authors used speaker ids for each utterance in order to generate an answer, which conditioned not only on encoder state, but also on speaker embedding. Speaker embeddings are learned from scratch along with the model.

Illustration source

Using this idea, you can augment your model with the different metadata you have. For example, if you know the tense of utterance (past/present/future), you can generate replies in different tenses at inference time! You can adjust the personality of the replier (gender, age, mood) or reply properties (tense, sentiment, question/not question, etc.) while you have such data to train models on.

For your practice

I promised you links to seq2seq models implementations in different frameworks, and here they are.

TensorFlow

Keras

Papers & guides

Diving into selective models

Getting done with generative models, let’s understand how selective neural conversational models work (they are often referred to as DSSM, which stands for deep semantic similarity model).

Instead of estimating probability p(reply | context; w), selective models learn similarity function — sim(reply, context; w), where a reply is one of the elements in a predefined pool of possible answers (see illustration below).

The intuition is that the network takes context and a candidate reply as inputs and returns the confidence of how appropriate they are to each other.

The selective (or ranking, or dssm) network consists of two “towers”: the first for the context and the second for the reply. Each tower may have any architecture you want. The tower takes its input and embeds it in semantic vector space (vectors R and C on the illustration). Then, the similarity between context and reply vectors is computed, i.e. using cosine similarity C^T*R/(||C||*||R||).

At inference time, we can calculate the similarity between given context and all possible answers and choose the one with maximum similarity.

In order to train the model, we use triplet loss. Triplet loss is defined on triplets (context, reply_correct, reply_wrong) and is equal to:

Triplet loss for selective models. It’s very similar to max-margin loss in SVM.

What is reply_wrong? It is also called “negative” sample (reply_correct is called “positive”) and in the simplest case, it is a random reply from the pool of answers. So, by minimizing such loss we learn similarity function in a ranking way where absolute values aren’t informative. But remember that at the inference phase we only need to compare scores for all replies and choose one with the maximum score.

You may dive deeper into DSSMs at a special Microsoft project page. There are not many open-source implementations such as with generative models, however you may refer to the tutorial which implements a selective model on TensorFlow.

Sampling schemes in selective models

You may ask, why should we just take a random sample from a dataset? Maybe it is a good idea to use a more complex sampling scheme? That’s true. If you look closer, you may realize that the number of triplets is O(n³), so it’s important to choose negatives properly, because we can’t go through all of them (big data, you know).

For example, we could sample K random negative replies from the pool, score them, and choose the one with the maximum score as our negative. This scheme is called “hard negative” mining. If you want to dig deeper, read the paper “Sampling Matters in Deep Embedding Learning.”

Generative vs selective: pros and cons

At this moment, we have an understanding of how both generative and selective models work. But which type do you choose? It fully depends on your needs. The table below is here to help you with the decision.

The hardest part is evaluation

One of the most important questions is how to evaluate neural conversational models. There are many automatic metrics which are used to evaluate chatbots with machine learning:

  • Precision/recall/accuracy for selective models
  • Perplexity/loss value for generative models
  • BLEU/METEOR scores from machine translation

But some recent research works have shown that all such metrics are poorly correlated with human judgement of appropriateness of the reply for a given context.

For example, suppose you have the context “Is Statsbot disrupting the way we work with data?” and reply “It surely does.” in your dataset. But your model replies to this context with something like, “It’s definitely true.” All metrics shown above will give a low score for such an answer, but we can see that this answer is as good as your data provides.

Therefore, the most proper way today is to perform human evaluation of your models using your target metric, then choosing the best model. Yes, this seems like an expensive process (you need to use something like Amazon Mechanical Turk for evaluation models), but at this current time we don’t have anything better. Anyway, the research community goes this direction.

Why don’t we see them in our smartphones?

Finally, we are ready to create the most powerful and intelligent conversational model, the general artificial intelligence, right? If this was so, companies like Apple, Amazon, and Google, which have thousands of researchers, would have already deployed them along with their personal assistant products.

Despite a lot of work in this area, neural dialogue systems are not ready to talk with humans in open-domain and provide them with informative/funny/helpful answers. But as for closed-domain (technical support or Q&A systems, for example) there are success stories.

Tutorials on RNN and word embeddings

Recurrent Neural Networks

Word embeddings

Conclusion

Conversational models may seem difficult to grasp at first (and not only at first). I advise you to read the original papers which I gave links to. Also, there is a pool which contains many essential papers on dialogue systems.

When you’re ready to practice, choose some simple architecture, take one of the popular datasets or mine your own (Twitter, Reddit, or whatever) and train a conversational model on it.

YOU’D ALSO LIKE:

--

--

Dmitry Persiyanov
Cube Dev

Leading Personalization & Ranking in Constructor.io. MIPT, YSDA, ex-Yandex.