Building a Chatbot using Deep Learning with TensorFlow and Keras

Subrata Maji
24 min readOct 2, 2020

--

Credit: https://www.information-age.com/

Abstract

Dialogue Generation or Intelligent Conversational Agent development using Artificial Intelligence or Machine Learning technique is an interesting problem in the Field of Natural Language Processing. Dialogue/conversation agents are predominantly used by businesses, government organizations and non-profit organizations. They are frequently deployed by financial organizations like bank, credit card companies, businesses like online retail stores and start-ups. Among current chatbots, many are developed using rule based techniques, simple machine learning algorithms or retrieval based techniques which do not generate good results. In this experiment I have built a chatbot using seq2seq architecture without and with attention mechanism. For encoder and decoder used RNN-LSTM to conserve to conserve time dependency. Also used beam search so that model doesn’t always predict only one possibility based the probability of output.

Introduction

https://chatbotsmagazine.com/

What is Chabot

A Chabot is an artificial intelligence (AI) software that can simulate a conversation (or a chat) with a user in natural language through messaging applications, websites, mobile apps or through the telephone.

A deep learning chatbot learns right from scratch through a process called “Deep Learning.” In this process, the chatbot is created using machine learning algorithms. A deep learning chatbot learns everything from its data and human-to-human dialogue.

How Chabot works

The basic operations occurred during human and chatbot interaction listed below:

1. The human asked a question using text or speech.
2. The chatbot should understand its content and the human’s intention, It’s the most important part.
3. The chatbot provide a response in the same medium as the request was given.
4. The Human should understand the chatbot response. And the cycle continues.

Types of chatbot

There are mainly 2 kind of bots explained below:

1. Generative — In the generative model, the chatbot doesn’t use any sort of predefined repository. This is an advanced form of chatbot that uses deep learning to respond to queries.

2. Retrieval Based — In this form, the chatbot has a repository of responses that it uses to solve the queries. You need to choose an appropriate response based on the questions, and the chatbot will comply.

Use cases of Chatbots:

1. Customer Service- website support, IT helpdesk and In-app support etc.

2. Sales- Booking and Pre-Qualifying leads.

3. Marketing — Product recommendation and Start a proactive conservation etc.

Objectives

Business Problem

Chatbots are becoming more and more important in the lives of small and medium businesses. They are typically used as support for customer service, form-filling help, or treating large numbers of data. Chatbots are being made to ease the pain that the industries are facing today. However, today’s chatbots need to be intelligent, purposeful and accurate to survive.

That’s why one should create a chatbot that can understand the intent of a customer’s request, query or complains and respond with accurate, precise information that has to be understood by the customer. It needs to understand the customer’s demands even when the sentence is complex. It has to be intelligent enough to predict what the customer is looking for.

https://www.callcentrehelper.com/

End Goal

Using Deep learning and NLP to make these chapbots intelligent and make them able to understand customer’s internet most of the time.

There should be some other features that these chatbot should have, like: They should be accessible all the time, can continue longer conversation and keep providing answers to customers, Flexible to work with many situation and industry and most important customer satisfaction.

Chatbots should reduce human dependency and process large volume of requests for industry and customer interaction and by this way it can save more time and money with results with better productivity for the industry.

Constraints

1. Handling long and complex sentences.
2. Quicker response time.
3. Limitation of current NLP, to solve problems like mixing of local language and slangs.

How to build a chatbot

There are many ways to build chatbots for more of a specific use rather that general use. Below mention few of the ways to create a chatbot:

1. Pattern Matching — One the easiest way to build, such chatbots use a knowledge base which contains documents and each document comprises a particular <pattern> and <template>. When the bot receives an input that matches the <pattern>, it sends the message stored in the <template> as a response. The <pattern> can either be a phrase like “What’s your name?” or a pattern “My name is *”, where the ‘*’ is a regular expression. Typically, these <pattern> <template> pairs are manually inserted.

2. Natural language Processing — Use of Bag of Words(BOW) with non-deep or deep networks. BOW are large vectors conserving a place for each of the words present in the vocabulary. Then encode the sentence with 0 or 1 where 0 represent the word is not present in the sentence and 1 means present. This method is fine for “Yes” or “No” type of response. Few problems with this are fixed size of input length and that is also very large and sparse, no sequence information preserved and very small output.

Seq2seq Architecture — Another way is to use encoder-decoder models. This method can conserve sequence information of the sentence. Uses embedding which is an upgrade of BOW encoding. Using many to many seq model where input and output dimension is different. As the weights for encoder and decoder are same so even if dimension change for input and output this model can still work. The problem with this model is if the input dimension is too large then the encoder could not conserve the context of the sentence. Attention model can handle this, where it’s gives importance to few particular words to generate a word in the decoder.

Use of Deep Learning

http://connect.creativevirtual.com/

Although, there are many chatbots currently available, majority of them are limited in functionality, domain function, context and coherence. They often fail in long conversations and have reduced relevancy in dialogue generation. Most of these chatbots are developed for restricted domain. Majority of them are using simple rule based techniques. They perform well in question answering sessions and in very structured conversational modes. But, fail to emulate real human conversation and lacks flexibility in functioning.

The retrieval model seldom makes mistakes as it’s completely based on retrieving data. However, it has its own set of limitations such that it can seem too rigid and the responses may not seem “human.”

On the other hand, a deep learning chatbot can easily adapt its style to the questions and demands from its customers. However, even this type of chatbot can’t imitate human interactions without mistakes.

Generative models perform better with complex queries, but The generative model of chatbots is also harder to perfect as the knowledge in this field is fairly limited. Even though there are still lots of experiment going on this field but deep learning has certainly shown lots of promises and already providing good results.

Source of Dataset

I am using Twitter customer support dataset for training the chatbot. You can get this dataset from below link:

The Customer Support on Twitter dataset is a large, modern corpus of tweets and replies to aid innovation in natural language understanding and conversational models, and for study of modern customer support practices and impact.

One of the main reason to use this dataset is that all conversations are limited to 140 words, they are real life conversations and explaining about problems and solutions.

Related Works

There have been many recent development and experimentation in conversational agent

system. many advanced chatbots are using advanced Natural Language Processing (NLP) techniques and Deep Learning Techniques like Deep Neural Network (DNN)

A ten-minute introduction to sequence-to-sequence learning in Keras

Author: Francois Chollet

One of the most effective method of building a chatbot is to use seq2seq models. This blog is from keras and teaches how to work with seq2seq architecture.

Sequence-to-sequence learning (Seq2Seq) is about training models to convert sequences from one domain (e.g. sentences in English) to sequences in another domain (e.g. the same sentences translated to French). This can be used for machine translation or for free-from question answering (generating a natural language answer given a natural language question)

This model has 2 parts: encoder and decoder. They are consisting of RNN layer (Recurrent Neural Network) is useful for keeping sequential information of given text as input.

In the encoder it takes inputs and returns hidden states for that particular timestamp and at the end it will pass the hidden states to the decoder, which will act as context for the entire given text to the decoder at the next step. Another RNN layer acts as “decoder”, it is trained to predict the next characters of the target sequence, given previous characters of the target sequence.

For interference model, pass the entire text through encoder and get the hidden state. Feed only one word to the decoder, the first word of the line. Get hidden state and decoder output for the current time step and feed those to next time step.

Chatbot using Seq2Seq LSTM models

This one is from from google colab. It a ipython notebook with complete code from the start to prediction.

Official description: “In this notebook, we will assemble a seq2seq LSTM model using Keras Functional API to create a working Chatbot which would answer questions asked to it. Messaging platforms like Allo have implemented chatbot services to engage users. The famous Google Assistant, Siri, Cortana and Alexa may have been build using similar models.”

Explaining the architecture below:

First, Pre-process data: Building question-answers pair. Remove unwanted data types which are produced while parsing the data. Append <START> and <END> to all the answers. Create a Tokenizer and load the whole vocabulary (questions + answers) into it.

Second, Seq2seq model: model requires three arrays namely encoder input data, decoder input data and decoder output data. The model will have Embedding, LSTM and Dense layers.

encoder_input_data : Tokenize the questions. Pad them to their maximum length. For decoder_input_data : Tokenize the answers. Pad them to their maximum length. For decoder_output_data : Tokenize the answers. Remove the first element from all the tokenized_answers. This is the <START> element which we added earlier.

Last, Interference model: Create inference models which help in predicting answers. Encoder inference model : Takes the question as input and outputs LSTM states ( h and c ). Decoder inference model : Takes in 2 inputs, one are the LSTM states ( Output of encoder model ), second are the answer input sequences ( ones not having the <start> tag ). It will output the answers for the question which we fed to the encoder model and its state values.

Architecture

We will use bidirectional many-to-many seq2seq model to build our chatbot.

Seq2seq model

Sequence-to-sequence learning (Seq2Seq) is about training models to convert sequences from one domain (e.g. sentences in English) to sequences in another domain (e.g. the same sentences translated to French).

https://www.javatpoint.com/tensorflow-types-of-rnn

There are various seq2seq models like:

1. One to many: It deals with a fixed size of information as input that gives a sequence of data as output. Example: Image Captioning takes the image as input and outputs a sentence of words.

2. Many to one: It takes a sequence of information as input and outputs a fixed size of the output. Example: sentiment analysis where any sentence is classified as expressing the positive or negative sentiment.

3. Many to many: It takes a Sequence of information as input and processes the recurrently outputs as a Sequence of data. Example: Machine Translation, where the RNN reads any sentence in English and then outputs the sentence in French.

4. Bidirectional Many-to-Many: Synced sequence input and output. Notice that in every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like. Example: Video classification where we wish to label every frame of the video.

This architecture has 2 components: i. Encoder ii. Decoder.

Encoder takes inputs and returns hidden states for that particular timestamp and at the end of the sentence it will pass the hidden states to the decoder, which will act as context for the entire given input sequence. Decoder is trained to predict the next characters of the target sequence, given previous words of the target sequence. Both of them are comprised of RNN LSTM.

RNN and LSTM

During any conversation or translation humans don’t start their thinking from scratch every second. We don’t throw everything away and start thinking from scratch again. Our thoughts have persistence. But traditional neural networks can’t do this, and it seems like a major shortcoming. RNN or Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.

One of the problem with RNN is that in practice they don’t seem to be able to learn Long term dependencies. That is if any output word is dependent on a word which came long before then RNN struggles. LSTM came to the rescue, Long Short Term Memory networks — usually just called “LSTMs” are a special kind of RNN, capable of learning long-term dependencies.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTM Networks have 4 main components:

1. Forget gate: This gate decides what information should be thrown away or kept. Information from the previous hidden state and information from the current input is passed through the sigmoid function. Values come out between 0 and 1. The closer to 0 means to forget, and the closer to 1 means to keep.

2. Input gate: To update the cell state, there are input gate. Pass the previous hidden state and current input into a sigmoid function. That decides which values will be updated by transforming the values to be between 0 and 1. 0 means not important, and 1 means important. Also pass the hidden state and current input into the tanh function to squish values between -1 and 1 to help regulate the network. Multiply the tanh output with the sigmoid output. The sigmoid output will decide which information is important to keep from the tanh output.

3. Cell state: Calculate cell state, first the cell state gets pointwise multiplied by the forget vector. This has a possibility of dropping values in the cell state if it gets multiplied by values near 0. Then take the output from the input gate and do a pointwise addition which updates the cell state to new values that the neural network finds relevant. That gives us our new cell state.

Output gate: The output gate decides what the next hidden state should be. Remember that the hidden state contains information on previous inputs. The hidden state is also used for predictions. First, pass the previous hidden state and the current input into a sigmoid function. Then pass the newly modified cell state to the tanh function. Multiply the tanh output with the sigmoid output to decide what information the hidden state should carry. The output is the hidden state. The new cell state and the new hidden is then carried over to the next time step.

Attention Mechanism

The problem with seq2seq models are all the information about the input sentence encoded into a fixed length vector. This causes problems for translation of long sentences. If you look at how humans translate we can see that we humans don’t take the entire sentence in our mind at once and then try to translate it entirely. We usually take a small part of the inputs translate it then then again go back to original input. So there is a back and forth structure. In order to solve this, they have introduced attention mechanism with encoder-decoder models.

Just like in “Attention” meaning, in real life when we looking at a picture or hearing the song, we usually focus more on some parts and pay less attention in the rest. The Attention mechanism in Deep Learning is also the same flow, paying greater attention to certain parts when processing the data Attention is one component of a network’s architecture.

Attention mechanism, in this paper they have improved the traditional encoder model by introducing bidirectional RNN encoder. So the use his is if in the input some word which is appearing after when decoder is generating a word so the decoder could not give importance to the proper word in the input seq. This is a real life scenario, for example any language translation it is normal to get a word in the input much later than it appeared in the decoder. The decoder should predict a word from the vocab which will have highest probability for the given context vector and hidden state, but one difference is that here the context vector is not same for all the input words.

The context vector depends on concatenated hidden states of encoders and alpha of some Tx number of words. Here Tx is hyper parameter. So the context vector is weighted sum of alpha and hidden states.

In this model type, a variable-length alignment vector at, whose size equals the number of time steps on the source side, is derived by comparing the current target hidden state ht with each source hidden state hs_bar.

Here, score is referred as a content-based function for which we consider three different alternatives:

Can calculate attention weights by applying Softmax layer on score. Given the alignment vector as weights, the context vector ct is computed as the weighted average over all the source hidden states.

BLUE score is a measure of translation. And from the plot we can see that 2 non-attention based models RNNenc-30 and RNNenc-50 is performing well for 20–30 words for inputs, after that its performance is going down. For attention based model RNNsearch-30 there is slight improvement and RNNsearch-50 is performing consistently for any number of input words.

There are some few nice heat map for better visualization:

Each pixel in the above heat map shown weight alpha, white and light blocks are having high value of alpha and black and dark pixels are having lower alpha. The x-axis of the plot is for English words and y-axis is for French words. To translate a word along with the correct word the model is giving attention to some other words for getting correct prediction.

So to conclude, I am going to use a seq2seq model with bidirectional LSTM and to improve the result I will include attention mechanism to the model. For Interference function use of beam search instead of greedy search will also handle out of vocab words.

Performance metrics:

Categorical cross entropy: To train the model with custom loss function to ignore loss of paddings.

BLEU score: To check the performance of the model.

Exploratory Data Analysis

Now that we have decided about the model and metrics. Let’s check the data in details and explore it by doing some basic EDA.

# Loading the data into dataframetweets_df = pd.read_csv('twcs.csv', encoding='utf-8')
# Check the size of the data
tweets_df.shapeout: (2811774, 7)

Details about all the columns and datatypes.

From this dataset we just need text column and then restructure the column into conservations between customers and companies.

# Final shape of our data after making it question-answer conversationsdata.shape(794299, 6)

Let’s check the count of companies we have in our data. Selecting top 50 companies and then replacing all others with “other” so that we can get better visualization.

Statistics:

We need to check statistics of various counts for both questions and answers and then need to visualize them using pdf and boxplot.

Let’s make a utility functions for that.

Pdf and Boxplot:

No of sentences present in each tweets:

# Sentences in questiondata[‘qsn_len’] = data[‘question’].astype(‘str’).apply(lambda x:len(TextBlob(x).sentences))# No of sentences in answerdata[‘ans_len’] = data[‘answer’].astype(‘str’).apply(lambda x:len(TextBlob(x).sentences))

Questions:

  1. No of sentences in questions starts from 1 to 107.
  2. Most of text has less than 5 sentences. That is 99% percentile of rows has 5 or less sentences.
  3. Few lines have 107 sentence, that’s unusual. But as the data is restricted to 140 words we don’t need to worry much about that.

Answers:

  1. In answer No of sent starts from 1 to 11.
  2. Most of the lines has less than 5 sentences
  3. Max no of sentence is also 11. That’s normal.

Number of words present in each tweets:

# Word counts in questiondata['qsn_words'] = data['question'].astype('str').apply(lambda x: len(x.strip().split()))# Word counts in answerdata['ans_words'] = data['answer'].astype('str').apply(lambda x: len(x.strip().split()))

Questions:

1. Min value of count is 1, that means 1 word in the line.
2. Max word is 112. The sentence consists of small characters
3. Most of the lines are having word count less than 50
4. In pdf we can see the value lies between 0 to 60 words in a line. IQR 14–24

Answers:

1. Min value is 1 and max is 66. We can observe that answer column is lots more sensible than question column.
2. Most of the lines have values less than 46 words.
3. In pdf most 0–50 words have high percentage of occurance.
4. IQR value is 14–24. Same as ‘question’. So both the data are somewhat similar in structure.

Number of mentions

In tweets there are many words start with “@” to mention other person. In this tweets also we have mentions to user and companies and many others. Will look into the counts of mentions in each tweets,

# Mentions in questionsdata['qsn_mention'] = data['question'].apply(lambda x: len(re.findall(r"@\S+", x)))# Mentions in answersdata['ans_mention'] = data['answer'].apply(lambda x: len(re.findall(r"@\S+", x)))

1. Maximum number of mentions in question is 25 and of answer its 16.
2. Generally, users use more mentions than customer executive.

This mentions are not adding any meaning to the text, so we will remove all the mentions and “user id” from text.

Number of words with hashtags “#”

We have many words with hashtags to create trends in tweeter. Let’s check about their stats

  1. Maximum value found in questions is 19 and in answers 6
  2. Use of hashtags are little less than use of mentions in twitter.
  3. But there are some people who tweets using only hashtag words and putting hashtags before all of the words in their tweets.
  4. So we will not remove hashtags words, as they are generally part of the conversation and have some meaning.

Data Cleaning

As mentioned earlier I am using Twitter customer support dataset for training the chatbot. As the data is unstructured text data and coming from social media, that’s why we need to more careful during data preparing. Few important things to do here are: Lower casting, remove special characters, removal of emoji and emoticons etc.

Let’s start clean our data-

Decontraction

We are going to expand words like I’m to I am, can’t to can not and so on.

Removal of emojis

It’s very common to use emoji in any social median posts. This data also has so many emojis in it. There are also emoticons which are not any special symbol but characters available in keyboard and we usually create emoji with them like: : ) : D and many other. We have to remove those. Below snippet of code can do this task.

Now we can clean the entire data using below function

As discussed during EDA we need to check few more stats of the data, so let’s start doing that.

Number of words in Text

  1. So the maximum no of words is 68 in questions and 63 in answers.
  2. Questions are having higher number of words than answers
  3. In questions words from 5–25 have highest volume and in answers its 10–25
  4. We can use max_len for our text=50, which is covering over 99 percentile of values

Count of common words

1. There are not much common words between questions and answers.

2. There are lots of 0 common words that means no common words. Also 1 and 2 are also there quite a lot.

3. We have no of words 3–5 noticeably.

Frequency of words in corpus

To get better visualization I am applying log on counts because the range is very large and most of the values are close to 0.

1. Looking for most frequent words in questions ans answers.

2. After looking the words we can see that most of the words only occur very few times. There are some words which are present huge no of times

3. Found that top 25 words are all stopwords and among top 50 most common words total 45 words are stopwords.

Data Preprocessing

Now that we have our data cleaned, we need to make this data to the format which we can send to models i.e. Encoders and Decoders.

After seeing few of the data point and checking the vocab length I found there are lots of spelling mistake and that’s making the vocab size about 150,000+ for question. To get good result we need to correct misspelled words.

# Using symspell to correct spelling
import pkg_resources
from symspellpy import SymSpell, Verbosity
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename("symspellpy", "frequency_dictionary_en_82_765.txt")
bigram_path = pkg_resources.resource_filename("symspellpy", "frequency_bigramdictionary_en_243_342.txt")
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)
# utility function
def correct_spellings(text):
" For a given sentence this function returns a sentence after correcting spelling of words "
suggestions = sym_spell.lookup_compound(text, max_edit_distance=2)
return suggestions[0]._term

After correcting spellings, the vocab size of question reduced to 50,000 and answer to 28,000. Also able to handle the problem of rare words in dataset.

We are going to use teacher forcing technique for training. We will send a special token for start of sent and another for end of sent. And for start token model will start predicting the next word and so on.

# Adding start and end tokens to decoder inputdata['clean_answer'] = '<start> ' + data['clean_answer'].astype(str) + ' <end>'

Selecting max length 95 percentile value that is 39. And creating encoder input, decoder input and decoder output datasets. Need to tokenize and pad the text to max length.

# Taking data less and equal 39 wordsdata = data[(data[‘qsn_len’]>2) & (data[‘qsn_len’] <= MAXLEN)]data = data[(data[‘ans_len’]>2) & (data[‘ans_len’] <= MAXLEN)]# Decoder output datadata['answer_out'] = data['clean_answer'].apply(lambda x: " ".join(x.split()[1:]))# Selecting necessary columnsdata = data[['clean_question', 'clean_answer', 'answer_out']].copy()data.rename(columns={'clean_question':'question', 'clean_answer':'answer_inp'}, inplace=True)

Now we have to tokenize this text to numbers. I am using tf keras Tokenizer() for this. And then to improve the performance we can load pre-trained embedding vectors. Using Fasttext model to generate embedding vectors of every words. The advantage of this is we can get vectors of out of vocab words also.

Building Model

Now that we have all the data ready along with pre-trained embedding matrix, now we should move to build the model. I will be using sub-classing models where it will have Encoder, Attention, Decoder and another class to call the final model.

Encoder

Encoder model will take the questions and then after creating embedding vectors of the inputs we will pass inputs through Bidirectional LSTM layers to get the context of the input sentence. Using Bidirectional LSTM because a predicted word in decoder can depend on the words after that timestep in encoder. We are also masking all the pads.

Attention

Now the biggest problem with LSTM layers is it do not perform well if the sentence is very long because it catches the entire sentence’s context into a single vector. So by using attention model can predict a words based on some particular words and then for another words it will again come back to input and give attention to other words.

OneStepDecoder

We are using this layer as an intermediate layer to decoder. It will perform the entire work of decoder on only a single word. So after getting outputs for all the words we can combine them and send as final model predicted output.

This layer first gets the embedding vector for the word and then pass through LSTM layers and finally through a dense layer where no of neurons will be same as answer vocab size and dense layer will return probability for entire word, with which we will be building our Interference function.

Decoder

Decoder will take encoder output and also encoder final states and then return predicted sentence. The initial state for decoder will be encoder’s final state. With the help of OneStepDecoder layer it will get outputs for all the words where current hidden states and output will be used to predict next timestep word.

Encoder_decoder

Now that we have built all the layers we will create another class to build the model. Passing all the required parameters as arguments and then calling the layers created above.

Custom loss function

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

def custom_lossfunction(real, pred):
# Custom loss function that will not consider the loss for padded zeros.
# Refer https://www.tensorflow.org/tutorials/text/nmt_with_attention#define_the_optimizer_and_the_loss_function
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)

mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask

return tf.reduce_mean(loss_)

After training the model for 10 epochs we can see the below graphs in tensorboard:

As we can see that the loss is still decreasing so let’s train again for 10 more epochs:

Interference

After training the model we will be working on interference function. In this function for a given sentence it will predict the output sentence. First tokenize and pad input sequence, then encoder will return the context vector for the sentence. The starting word is “start” token and then feed current time steps output and states back to the model to predict the next time step word.

But instead of just picking only one word with the highest probability as the case of greedy search here we will consider another possibilities and then pick based on that.

Also I am taking attention weights to plot it which will provide us a nice visualization of which words are being used to predict a word. Let’s look at few samples

Result Analysis

After predicting the entire validation data set and then getting the BLEU score we can find that:

Average BLEU score: 0.4310485454195498

The BLEU score not that high we will now do some analysis on results and try to find out possible causes why the model is not giving better score.

Below are the steps we are going to perform to analysis results:

  1. Divide the data into good, medium and bad based on BLEU score of the sentence.
  2. Check if number of words in a sentence is affecting the performance or not.

3. Will look into no of common words present in question and answer and question and prediction and if it has any affect.

4. Frequency of words in corpus both question and answer

5. Number of very high frequent words in a sentence.

Providing few EDA plots which have some effect on model’s performance

  • Number of words in a sentence for Answer
  • Frequency of words in Questions

Plotting the frequency in log scale other it is difficult to visualize.

  • High frequent words in sentences

I have done in details EDA on all the features mentioned above, you can check the notebook in my GitHub.

Conclusion:

To conclude our EDA and explain the factors affecting model’s performance, we have below details,

1. No of words in sentences of Answers are impacting the model. Model works good for smaller no of words in sent as count increases its performance is going down.

2. One of the main factor behind model’s performance. With higher no of words in vocab both question and answer reducing its BLEU score. Vocab has 80% of rare words that model is not able to learn. And then 5% of frequent words which highly impacting the result.

3. Model is only able to predict 15% of words from entire answer vocab and then for all the words it is not able to predict replacing with high frequent words.

4. No of frequent words in answer when increases and frequent words in predicted sentences decreases then model is giving “bad” BLEU scores.

Future Work

As the result of the experiment is not that satisfactory and after the error analysis found few things that can be done in future to improve the performance:

Train the model for more number of epochs.

Clean the data more and remove more amount of rare and frequent words.

Select some other dataset that don’t have these much rare words in it.

Use latest and better techniques like transformers.

References

GitHub

You can find my entire work, all the codes and ipython notebooks in the in this github profile.

LinkedIn

--

--