Building Your First QA Chatbot With Python

Hands on AI
The Startup
Published in
6 min readNov 25, 2020

In this tutorial we will create a simple and cool chatbot that will be able to answer your questions about a text data that you feed to it. Familiarity with NLTK and python programming is expected.

See on Github

First, install NLTK by running the following command in your python/anaconda command prompt,

pip install nltk

Second, create a new Jupyter notebook.

Now, lets load NLTK packages,

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('popular', quiet=True) #downloads packages

Our QA bots needs some data so that it can answer questions related to it.

You can create a new text file directly from Jupyter window just like you create a new python notebook.

To keep things simpe, I have only copied the introduction part of wikipedia entry on “History of India”.

Our plain text file looks like this:

Plain text file created in Jupyter notebook

Let us load this file into our python notebook.

f=open('history_india','r',errors = 'ignore')
raw=f.read()
raw = raw.lower() #converts to lowercase to reduce repetition of words like The and the or When and when

The data is loaded and we are ready to play with it and go on to build our first chat bot that will answer most of your questions on ‘history of India.’

But first, lets give some structure to our data.

sent_tokens = nltk.sent_tokenize(raw)# converts to list of sentences 
word_tokens = nltk.word_tokenize(raw)# converts to list of words

Trim the words to their root form using a lemmatizer. WordNet is a semantically-oriented dictionary of English included in NLTK. That means each words is given a sentiment score of positive, negative or neutral. For example good, great, happy are positive words; sad, unsure, unfortunate are negative words and words like running, play, climb, etc are neutral.

lemmer = nltk.stem.WordNetLemmatizer()

def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

The above function ‘LemTokens’ takes tokens as input and return normalized tokens.

Our data is processed and loaded. Great!

Now we have to generate answers from our bot for input questions.

To achieve this we will use the concept of document similarity. We will define a function response that will search for one or more keywords from user query and return one of several possible responses.

The bot may remind itself (actually reminding you) to be trained on more data if it has no response to display for your particular query.

def response(user_response):
robot_response=''
sent_tokens.append(user_response)
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
tfidf = TfidfVec.fit_transform(sent_tokens)
vals = cosine_similarity(tfidf[-1], tfidf)
idx=vals.argsort()[0][-2]
flat = vals.flatten()
flat.sort()
req_tfidf = flat[-2]
if(req_tfidf==0):
robot_response=robot_response+"I think I need to read more about that..."
return robot_response
else:
robot_response = robot_response+sent_tokens[idx]
return robot_response

Response generation as seen above is built using three principles namely, bag of words, TF-IDF and cosine similarity.

In short, bag of words, as the name suggests is a bag of words. Haha not really.

‘Bag of words’ is frequency of words within a document. It involves two things, a vocabulary of known words and a measure of the presence of known words.

Bag of words is only concerned if the word occurs in the document or not, it does not know where does the word occur in text.

Similar documents will have similar words is the intuition behind bag of words.

Bag of Words approach has a limitation, that highly frequent words dominate the document but may not contain much information about the text data in focus. We use TF-IDF approach to fix this and score words on their relevancy and not just on their frequency of occurance.

Read more about TF-IDF approach here.

Lastly, we apply cosine similarity score to compare distances between words. For example cricket and football are more close than sun and playground or milk and moon.

Finally we add some greetings capability to our bot and we are all set to ask questions.

GREETING_INPUTS = ("namastey","namaskaram","hello", "hi", "whats up","hey")
GREETING_RESPONSES = ["namastey","namaskaram","hello", "hi", "whats up","hey"]
def greeting(sentence):

for word in sentence.split():
if word.lower() in GREETING_INPUTS:
return random.choice(GREETING_RESPONSES)

Now we are ready to launch. Run the below code and start interacting with your first hand made QA bot.

flag=True
print("Spongebot: Namaskaram I am Spongebot. I am an expert on history of India, you can ask me anything. Go ahead!")
while(flag==True):
user_response = input()
user_response=user_response.lower()
if(user_response!='bye!!'):
if(user_response=='thanks' or user_response=='thank you' ):
flag=False
print(""Spongebot: Anytime"")
else:
if(greeting(user_response)!=None):
print("Spongebot: "+greeting(user_response))
else:
print("Spongebot: ",end="")
print(response(user_response))
sent_tokens.remove(user_response)
else:
flag=False
print("Spongebot: take care..")
##code ends here
Transcript of my conversation with Spongebot
Spongebot: Namaskaram I am Spongebot. I am an expert on history of India, you can ask me anything. Go ahead!tell me a fact about history of india
Spongebot: dissatisfaction with the company rule in india led to the indian rebellion of 1857, which rocked parts of north and central india, and led to the dissolution of the company.
tell me more about north and central indiaSpongebot: dissatisfaction with the company rule in india led to the indian rebellion of 1857, which rocked parts of north and central india, and led to the dissolution of the company.what do you know about indus valley civilization
Spongebot: [2] by 4,500 bce, settled life had spread more widely,[2] and began to gradually evolve into the indus valley civilization, an early civilization of the old world, which was contemporaneous with ancient egypt and mesopotamia.
is is todays pakistan
Spongebot: this civilization flourished between 2,500 bce and 1900 bce in what today is pakistan and north-western india and was noted for its urban planning, baked brick houses, elaborate drainage, and water supply.
what can you tell me about aryans
Spongebot: I think I need to read more about that...
anything you know about aryan tribes
Spongebot: around the same time, indo-aryan tribes moved into the punjab from regions further northwest in several waves of migration.
what about the vedic period
Spongebot: the resulting vedic period was marked by the composition of the vedas, large collections of hymns of these tribes whose postulated religious culture, through synthesis with the preexisting religious cultures of the subcontinent, gave rise to hinduism.
nice, what about buddhism
Spongebot: the urbanization was accompanied by the rise of new ascetic movements in greater magadha, including jainism and buddhism.
have you heard of the golden age of india
Spongebot: this period, witnessing a hindu religious and intellectual resurgence, is known as the classical or golden age of india.
what is the most significant event of indias history
Spongebot: [11][12]

the most significant event between the 7th and 11th centuries was the tripartite struggle centered on kannauj that lasted for more than two centuries between the pala empire, rashtrakuta empire, and gurjara-pratihara empire.
any word on indias gdp
Spongebot: the early modern period began in the 16th century, when the mughal empire conquered most of the indian subcontinent,[22] becoming the biggest global economy and manufacturing power,[23] with a nominal gdp that valued a quarter of the world gdp, superior to the combination of europe's gdp.
when did india gain independence from british rule
Spongebot: india was afterward ruled directly by the british crown, in the british raj.
when did british empire leave india
Spongebot: india was afterward ruled directly by the british crown, in the british raj.
when was british indian empire partitioned
Spongebot: india was afterward ruled directly by the british crown, in the british raj.
when was pakistan formed
Spongebot: the british indian empire was partitioned in august 1947 into the dominion of india (present day republic of india) and dominion of pakistan (present day islamic republic of pakistan and people's republic of bangladesh), each gaining its independence.
thank you
Spongebot: Anytime

I hope you had fun while talking to your chatbot and it inspired to build your own chatbots for various applications.

#chatbot #qabot #faqbot #machinelearning #nlp #python #handsonai #nltk

--

--

Hands on AI
The Startup

Tutorials and case studies on various aspects of machine learning and artificial intelligence.