ChatBots, the AI for all (Part 2: Dive in NLP & Deep Learning)

Youssef Fenjiro
6 min readJan 5, 2019

--

Deep NLP (Natural Language Processing) is a critical piece of any human-facing artificial intelligence and Chatbots is one of its applications. In this blog, we will see how to implement a Chatbot using deep NLP model called Seq2Seq which was initially made for Machine Translation but was adopted to perform tasks like text summarization, Image Captioning, and Conversational Modeling (Chatbots). So we will start by preparing dataset that will be used to train the Seq2Seq model, then we will talk about Recurrent neural networks (RNN) that are the basic elements for building this model, and finally, we will present the Seq2seq architecture and explain its functioning:

- Preparing dataset for NLP

- Recurrent neural networks (RNN)

- Seq2Seq model

I- Dataset preparation for NLP

To train a Deep learning NLP network in supervised mode, we need labeled dataset, so as the chatbot seq2seq model will learn how to process questions and generate corresponding answers. Here some datasets that we can use :

In our example, we use Cornell Movie — Dialogs Corpus that contains 220,579 conversational exchanges (304,713 utterances) between 10,292 pairs (involving 9,035 characters) extracted from 617 movies:

Cleaning:

First, for the 2 files we get “Questions” and “Answers”, we must proceed to the cleaning, by replacing short form terms by their corresponding long terms:

Filtering:

Remove infrequent words that appear time to time, by counting words appearance for (less than a certain threshold, example 20), and replace it by a tag <OUT>.

Padding:

For the seq-2seq model, Questions, and answers sentences must have the same length, that why we apply padding technique by adding a term “PAD” when the sentence is shorter than the fixed initial length.

2- Tokenizing

Knowing that deep learning models understand only mathematics and numbers, the input word sequences must be encoded into a vector of numbers before feeding the Seq2Seq model. We use a two-step process to convert text into numbers that can be used in a neural network.

The first step is Tokenizing that converts text-words into integer-tokens, by splitting the text into smaller parts (words and punctuations) called tokens, creating 2 dictionaries, one for “Questions” and another for “Answers” because their vocabularies are different and adding start <SOS> and end <EOS> tokens at the beginning and end of each utterance.

3- Word Embedding (Encoding corpus words)

The second step is to convert integer-tokens (words) into vectors of floating-point numbers. Many methods like Bag-of-words (e.g. TF-IDF or Count Vectorize), LDA, LSA or Word Embedding. The last one, Word Embedding, is recommended since it does not suffer from drawbacks like “high dimensional vector” that grow with the corpus size.

Word Embedding encodes every word using a pre-defined and fixed vector space of N dimensions (E.g N=300), regardless of the size of the corpus. The word vector encodes the semantic relationship between words. Words have similar meaning if their vectors are closed (e.g using cosine similarity).

II- Recurrent Neural network (RNN)

RNN is a deep network that extracts temporal features while processing sequences of inputs like text, audio or video. It’s used when we need history/context to be able to provide the output based on previous inputs, like for video tracking, Image captioning, Speech-to-text, Translation, Stock forecasting, etc.

RNN neuron uses its internal memory to maintain information about the previous inputs and update the hidden states accordingly, which allows them to make predictions for every element of a sequence.

RNNs have shown great success in many NLP tasks, the most used type of RNN are LSTMs, that perform very well at capturing long-term dependencies than RNNs can do (due to the Vanishing gradient problem). GRU is a newer version of RNN with a less complex structure (fewer parameters) than LSTM, its training is a bit faster and need less data, but may lead to lower results.

III- Seq2Seq architecture and functioning:

Almost all task in NLP can be performed using a sequence to sequence mapping models: machine translation, summarization, question answering, and many more. An Encoder-Decoder model for recurrent neural networks is an architecture for sequence-to-sequence prediction problems in the field of natural language processing NLP, it takes a sequence as input and generates another sequence as output, It is comprised of two sub-modules :

· Encoder: Process the input sequence to detect important patterns, in order to shrink it into a smaller fixed length “context vector”, this feature vector hold the information, that represents the input, which becomes the initial state to the first recurrent layer of the decoder part.

· Decoder: generates a sequence of its own that represents the output. It gives the best closest match to the intended output during the training or to the actual input during the test or after Go live.

To higher the performance and accuracy of the model, two additional algorithms can be used:

Attention mechanism:

So as to perform well on long input or output sequences, we use Attention mechanism which tells the model the specific parts of the input sequence on which it must focus when decoding by providing a richer context from the encoder, instead of using only the raw “context vector”. See the example below:

Beam search:

beam search is an algorithm that builds a search tree and tries to find the best path for a given number N on tree levels (limited set of nodes) in a greedy way.

Conclusion

The Seq2Seq model allows making a more realistic and human chatbot, the Dataset is also a crucial element in this equation, the larger and more diversified it is, the best is the user experience and perception.

--

--

Youssef Fenjiro

Data scientist, Machine learning & Artificial intelligence.