An exposé on Retrieval-Based ChatBot.

8 min readJun 3, 2023

In this article, I will explain how I built a retrieval-based chatbot in Python. The chatbot in this case is a FAQs ChatBot for Babcock University, Nigeria. The aim of the chatbot is to answer questions that current and future students or the general public may have concerning Babcock University.

Project Background

Credit goes out to Arije Gbejesuga for coming up with such an amazing project idea and contributing to the data-gathering processes.

Before reading the rest of this article, if you have little to no knowledge of Machine Learning and you would love to learn, I recommend you scroll down to the “Conclusion and Recommendations for further reading” section and follow the first link to gain access to a free resource that will get you started.

All said and done, it’s time to get into the interesting stuff!

Content

Overview of ChatBots and Types.
A Deeper Dive Into Retrieval-Based ChatBots.
The Data Gathering Process.
The Data-Wrangling Process.
The Model Training & Evaluation Process.
Potential Model Improvement Techniques.
Extracting Response from Look-Up Table.
Demo of the System.
Conclusion and Recommendations for further reading.
References.

Overview of ChatBots and Types

A chatbot is a computer software program that imitates human communication either through text or verbal interactions. There are two major categories of chatbots namely; Retrieval-based chatbots and Generative-based chatbots.

A retrieval-based or lookup-based chatbot as the name implies is a computer program that retrieves responses from a predefined set of responses on a lookup table while a generative-based chatbot is a more advanced type of chatbot that is capable of generating new responses on its own without a lookup table based off of experience gained from huge training datasets.

Generative chatbots usually require way more data than retrieval-based chatbots to train because they need a better understanding of context, semantics, and probability of the appearance of words with respect to other words in order to generate responses. This article however focuses on the lookup-based approach as it is best suited for this use case.

A Deeper Dive Into Retrieval-Based ChatBots

Retrieval-based chatbots are no less bots simply because they have lookup tables. In order for a retrieval-based chatbot to work right, it needs to be able to identify the correct tag or category a prompt falls under whether or not the exact prompt is sampled in the training dataset, this leads to the discussion of the format for training data.

The Data Gathering Process

The data used in this project was collected using a diverse range of methods such as;

Google Forms: A form was created to gather data in a specific format. Participants entered their questions as well as suitable tags for the questions.
Web scraping: We scraped data from blogs, the Babcock University website, and various other places using Beautifulsoup and Requests some of which were manually tagged by typing while the rest was automated.
Existing data on Kaggle: Kaggle is a repository of datasets, we were able to find valuable data there which we combined with the already existing data.
We also included some questions we anticipated people will have and labeled them properly.

Some of the datasets came in CSV file format while others came in JSON format. I was able to convert all the datasets into one JSON file using a Python script in order to get a richer dataset.

Screenshot showing training data in JSON data format.

Datasets can come in any format i.e. JSON, CSV, HTML, etc. What’s important is that you are able to create a data frame out of it for proper data analysis and wrangling. The dataset above is stored in JSON (Javascript Object Notation) format, it contains a variety of possible prompts and their respective tags used to identify the category of intent the query belongs to. Retrieval-based models are trained to predict intent which in turn is used as a key to retrieve suitable responses from a lookup table.

Above is an image of a lookup table that consists of tags/intent as the keys and an array of responses as values. With the buildup so far, you should understand the inner workings of retrieval-based chatbots.

The Data-Wrangling Process

Image showing importing of all necessary libraries.

After importing all the necessary libraries including but not limited to Pandas, NLTK, Scikit-learn, Matplotlib, and Seaborn. I proceed to read in the dataset and store it in a variable, check the data description, information, and convert categorical data to string type where necessary.

Data wrangling comes up next which includes the removal of punctuations, changing all word cases to lower case, stemming or lemming of words, and any other necessary data preparatory action.

This captures the code snippet used to perform all the data-wrangling processes mentioned, I also saved the transformer used to convert the tags from strings to their vector formats for future use. The pickle library was used here.

Snapshot of data frame showing original and transformed queries.

This image shows the original queries and tags on the left-hand side while the right-hand side of the data frame contains the fully transformed labels and the clean queries waiting to be vectorized. Before splitting data into training and test sets, it is advisable to shuffle the entire dataset using the sample function in order to avoid any form of overfitting. Frac is an argument of the sample function that takes a float value, it is used to specify the fraction of the data frame to return at random, using 1.0 means returning 100% of the data frame at random which is basically a complete shuffle.

The Model Training & Evaluation Process

Training

Image showing queries vectorization, vectorizer saving, train & test split, and finally model training.

From this image, instantiation of the vectorizer for the queries occurs first which is succeeded by fitting the clean queries to the vectorizer using .fit. The vectorizer is then saved in a pickle file after which it is read in to use appropriately.

The next step is to identify the dependent and independent variables before applying the train_test_split function and finally fit an algorithm of your choice usually based on the nature of your dataset.

Evaluation

The very first evaluation is to predict tags on both training and test datasets and then compare the accuracy score of the model's performance on both data sets (train and test set).

In the image above, the model has an accuracy above 90% with little to no difference between the train and test datasets meaning it is performing excellently without overfitting however, this wasn’t always the case. I had to perform a variety of improvement techniques such as adjusting the n-grams to bigrams and trigrams, increasing dataset and many more which will be discussed in detail soon. I also went ahead to plot a heatmap of the confusion matrix in order to gain a deeper understanding of the model’s classification capacity on various tags.

Heatmap of the model’s confusion matrix.

Potential Model Improvement Techniques

In order to improve the accuracy of an NLP model, several things can be done such as; increasing dataset size, adjusting test size, and adjusting the number of grams/n-grams amongst others (you can read more on n-grams here). Hyper-parameter tuning is another means to improve model accuracy.

Extracting Response from Look-Up Table.

The image above is a separate Python script within the same project directory containing all necessary functions for processing new text, predicting the text tag, and also retrieving an appropriate response from the lookup table. The first few lines are for importing all necessary libraries, and the next set of lines loads in all saved transformers as well as the model, and the look-up table.

The pre_process function accepts new string of text and performs the same set of data transformation steps performed on the training data so that the text is in an acceptable format for the model.

The predict_tag function expects the text vector from the pre_process function which is passed to the model for prediction and the tag is returned.

The final function which is the generate_response function takes in a tag, looks for the tag within the lookup table, and selects a random response from the array of responses for that tag.

Demo of the System

Below is a demo video of the bot. I prepared REST API endpoints using Flask to communicate with the front end. This bot can seamlessly be integrated into any system by making API calls. The video below shows how the bot performs in real time.

Demo of Babcock FAQ ChatBot in action.

Conclusion and Recommendations for further reading.

With this basic introduction to retrieval-based chatbots, you should now be able to build your very own chatbot for any purpose just as long as you have the required data. Here are links to various resources I put together to help you out in case you need help with the following:

References

What is a Chatbot and Why is it Important? (2021, November 1). Customer Experience. https://www.techtarget.com/searchcustomerexperience/definition/chatbot

Nguyen, K. (2022, July 24). N-gram language model. Medium. https://medium.com/mti-technology/n-gram-language-model-b7c2fc322799

EBEREONWU, E. (2023, March 8). Saving Your Machine Learning Model In Python: pickle.dump(). Medium. https://medium.com/mlearning-ai/saving-your-machine-learning-model-in-python-pickle-dump-b01ae60a791c

EBEREONWU, E. (2022, December 5). Web Scraping with MS Excel and Python: Static Site Contents. Medium. https://medium.com/@einsteinmunachiso/web-scraping-with-ms-excel-and-python-static-site-contents-4903ea08b85

EBEREONWU, E. (2022, September 13). Building The Perfect Machine Learning Project: Steps to follow. Medium. https://medium.com/mlearning-ai/building-the-perfect-machine-learning-project-steps-to-follow-197a49650aad

EBEREONWU, E. (2023, March 8). Building A Simple Web-App For Your ML Models & More: Flask, HTML & CSS. Medium. https://blog.devgenius.io/building-a-simple-web-app-for-your-model-flask-html-css-dd6cbd74d1ed