Building A Generative Chat-bot Using Deep Learning and NLP …

Heman Oberoi
10 min readSep 14, 2020

--

Chat-bots are easily one of the most well-known examples of artificial intelligence.

Chat-bots will be the primary source for interaction in the near future as it has great potential as compared to traditional methods of interaction.

Introduction

As the name suggests this case study is all about building a Generative Chat-bot using Natural Language Processing techniques and Deep Learning concepts.

So exactly what is a chat-bot ?

Chat-bot constitutes of two words ‘Chat’ and ‘Bot’.

Bots are basically automated programs, as they run according to their instructions without a human user needing to start them up. Bots often imitate or replace a human user’s behavior.Whereas chat refers to on-line chat conversation via text or speech.

So Chat-bot is a software program that uses text or speech to simulate interactions with customers automatically instead of direct communication with a live human. So basically chat-bot is a service that can have a conversation with you just like a real person.

The goal of a chat-bot is to mimic human conversations in a quicker and more accurate manner. Chat-bots can be used through a variety of different mediums like SMS, live chat, or even social media.

What is generative based chat-bot ?

Chat-bots which generates the response/reply on their own unlike retrieval chat-bots which chooses from predefined responses. They are trained using a large number of previous conversations, based upon which responses to the user are generated. They require a huge amount of conversational data to train.

Objective: The objective of this case study is to develop a generative based chat-bot which generates a reply based on the context given by the user.

Business Benefits:

Chat-bots are bringing a new way for businesses to communicate with the world and most importantly with their customers. The concept of ‘Chat-bot’ is a blessing for businesses as follows:

Customer Service: Chat-bots can improve the area of customer service to a whole new level because of the following reasons:

  • They can be available 24/7 and can respond to customers immediately.
  • Their responses are consistent every time.
  • They can handle an endless number of conversations at the same time.
  • Chat-bots are a smarter way to ensure that customers receive the instant response that they demand.
  • By employing chat-bots we can avoid the problems caused by human errors.

Cost Cutting: Businesses that offer customer support hires live agents which means infrastructure costs, training costs, and loss of time. In such scenario, chat-bots can be the best way to handle conversations. Hence chat-bots have become a cost-effective way for businesses to connect with prospective customers.

Increased Customer Engagement : In the area of sales, chat-bots can be great promoters, which can hook customers and increase customer engagement by encouraging customers to try different products based on their preferences. For example if you’re having a sale or have new products, a chat-bot can be used to communicate these offers to the customers which in-turn can increase revenue.

Consumer Analytics: Chat-bots can be useful for providing consumer analytics as chat-bots provides in-depth customer insight and the data derived from these conversations is going to change businesses.

Virtual Assistants: Speech chat-bots are being used by various consumers and it became an important part in enhancing the user experience for performing basic personalized tasks. Some of the speech chat-bots/virtual assistants which are popular are Alexa by Amazon, Google Assistant, etc.

There are plenty of chat-bot benefits across businesses such as customer service, sales, and marketing. Their fast response times and ability to resolve simple requests are still distinct benefits that work.

Chat-bots can’t replace human agents entirely, but they certainly do take a load off of them.

Data Set Used: Amazon QA

This data set contains Question and Answer data from Amazon, totaling around 1.4 million answered questions. But we will consider only few categories that are:

  • Appliances
  • Arts, Crafts and Sewing
  • Automotive
  • Cell Phones and Accessories
  • Clothing, Shoes and Jewelry
  • Electronics
  • Grocery and Gourmet Food

We will also use product metadata as the nature question/answer varies with different products.

Performance Metric: Perplexity

Perplexity is one of the ways to evaluate language models.

The word “Perplexed” refers to confusion whereas the meaning of the word “Perplexity” is the inability to deal with or understand something.

So Perplexity metric gives us the measure of how confused/uncertain is the model to predict the output text. The low value of perplexity indicates that a model has a good performance whereas high value of perplexity denotes a model which is performing badly.

Perplexity is just an exponentiation of the entropy and entropy is a measure of information.

Reading Data:

Merging Q/A and Product Meta Data:

Exploratory Data Analysis

Exploratory data analysis is an approach to analyze data sets to summarize their main characteristics, often with visual methods.

Checking Missing values :

Checking if any column in dataframe has Nan values.

Question Data

No. of Questions Vs Question Type:

Histogram of Question Word Count:

Box Plot of Question Word Count by Product Category:

Answer Data

No. of Answers Vs Answer Type:

Histogram of Answer Word Count:

Box Plot of Answer Word Count by Question Type:

Box Plot of Answer Word Count by Product Category:

Word Cloud for Answers:

Similar EDA on Product Meta Data can be found here .

Summary of EDA

  • Around 40% — 50% of product meta-data is missing.
  • Price is missing for around 70% of the data
  • There are two types of questions: Yes/No Type, Open Ended Type
  • Most of the answers are short to medium answers i.e around 40 words.
  • Outliers having exceptionally large no. of words for answers should be removed using some threshold.
  • All the columns containing text should be processed such as: Conversion to lower case, Removal of special characters.

Data Pre-Processing

Following Pre-Processing Steps have been performed:

  • Removal of Data points Which Does Not Have All Product Information.
  • Filtering data based on Answer Word Count
  • Processing Price column
  • Lower Case Conversion
  • Text De-Contraction
  • Adding space between a word and the punctuation.
  • Removal of Special Characters
  • Removal of Extra Spaces
  • Adding Start and End Tokens in Questions And Answers.

Code for performing these steps can be found here.

Train-Validation-Test Split

Splitting Data into Train, Validation and Test Set.

Vectorization:

Label Encoding Categorical Variables:

Tokenization, Padding:

What is Tokenization?

Tokenization is the process of tokenizing or splitting a string/sentence into a list of tokens/words and then assigning unique number to each word in the vocabulary.

What is Padding ?

Different Texts/Sentences could be of different lengths i.e. different word count. So padding is done by adding zeroes to the sequence to achieve same length for every text.We need to have the inputs with the same size, this is why padding is necessary.

We are using Fast-text model for fetching the embedding weights for each word as it reduces the problem of out-of-vocab. words.

And Initialize the Embedding layer with these weights and set training to false i.e. Transfer learning which would reduce the training time and the model would generalize well.

Similarly other text features such as: description, title, feature, brand can be tokenized and padded.

Modelling :

Sequence to Sequence (Seq2Seq) model is a good choice for this particular problem as we have sequence of words for both input and output i.e questions and answers respectively.

Along with Seq2Seq Model we will use the concept of Attention so that the model can perform better for long sentences.

Instead of encoding the input sequence into a single fixed context vector, the attention model develops a context vector that is filtered specifically for each output time step. So that the model can pay attention to the relevant parts of the input sequence.

Seq2Seq + Attention Model Architecture :

The Seq2Seq model involves two recurrent neural networks, one to encode the input sequence, called the encoder, and a second to decode the encoded input sequence into the target sequence called the decoder.

Encoder:

Attention:

One-Step Decoder:

One-Step Decoder is used to process only one time-step at a time.

Decoder:

Decoder which concatenates all the single outputs into one.

Model Class:

Loss Function :

Loss Function which will be minimized here is Sparse Categorical Cross-entropy.

Why Sparse Categorical Cross-entropy ?

First of all we are using Categorical Cross-entropy as the output is categorical in nature i.e. n no.of words where n is the vocab. size.

Secondly we will use Sparse Categorical Cross-entropy as we are using label encoding for encoding our data into tokens instead of one hot encoding.

We will Mask our Loss function so that the padded zeros doesn't affects the loss.

Callbacks:

Callbacks are functions which can be defined and used when the user wants to automate some tasks after every training epoch that help you have controls over the training process.

Callback for printing perplexity after every epoch :

Checkpoint Callback for saving the best model weights :

Callback for Tensorboard :

Tensorboard is used to visualize various aspects of the model such as:

  • Visualizing metrics such as loss and accuracy
  • Histograms of weights, biases, or other tensors as they change over time
  • Visualizing the model graph (ops and layers) and much more.

Defining Data and Model Parameters:

Training Model:

Train and Val. Loss

Perplexity Achieved:

  • Train → 56.51
  • Validation → 86.65
  • Test → 79.13

Inference:

Function for Predicting Answers:

The Seq2Seq Model can use two mechanisms for predicting the output:

  1. Greedy Search:

The Decoder generates probabilities for each word at each time step, So one way is to choose greedily i.e. choosing the most probable word at each time step. This does not necessarily give us the sentence with the highest combined probability.

Random Testing on Test Data using Greedy Search:

2. Beam Search:

Unlike greedy search it does not chooses the most probable word as the sequence is constructed, the beam search keeps the k most likely words, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.

Beam search takes into account the probability of the next k words in the sequence, and then chooses the proposal with the max combined probability.

Random Testing on Test Data using Beam Search:

Error Analysis:

Error Analysis is done to better understand the behavior of the model i.e where is the model making errors ?, Are these errors dependent on something? Etc.

Computing BLEU Score for each True Answer/Predicted Answer:

Filtering Good/Bad Prediction Using Some Threshold :

Using 0.65 Bleu Score as threshold i.e. answers having ≥0.65 BLEU Score will be considered as good predictions and <0.65 as bad predictions.

Predictions By Question Type:

Predictions by Answer Type:

Predictions by Answer Word Count:

Predictions by Product’s Main Category:

  • Product’s main category does not tell much about good or bad predictions.

Summary of Error Analysis:

  • The Model Predicts better for Yes/No type of questions rather than open-ended questions.
  • Short Answers having word count less than 10 words are more often correctly predicted by the model as compared to long answers.
  • Good Predictions also have more no. of questions which have yes as answer instead of any other answer type.
  • Good or Bad Predictions does not depend on product category.

Note:

  • Predictions having BLEU score of >=0.65 are considered as Good /Correct Predictions.
  • Predictions having BLEU score of <0.65 are considered as Bad Predictions.

For Step by Step Complete Code

Refer to these Jupyter Notebooks

Future Work :

  1. Model could be trained for more epochs.
  2. Training Parameters such batch size, Learning rate can be tuned for better performance.
  3. BERT could be used to get pre-trained sentence embeddings.
  4. More Data could be used to train the model.

Well, That’s All Folks!

Thanks For Reading.

I Hope You Learned Something New.

You can also find and connect with me on LinkedIn and GitHub.

Don’t forget to give your 👏 !

References :

  1. https://upcommons.upc.edu/bitstream/handle/2117/117176/TFG_final_version.pdf?sequence=1&isAllowed=y
  2. https://udibhaskar.github.io/practical-ml/debugging%20nn/neural%20network/overfit/underfit/2020/02/03/Effective_Training_and_Debugging_of_a_Neural_Networks.html
  3. https://www.youtube.com/watch?v=RLWuzLLSIgw&t=53s
  4. https://towardsdatascience.com/an-intuitive-explanation-of-beam-search-9b1d744e7a0f

--

--