How to build a Simple Chatbot with Python and NLTK

5 min readMar 29, 2023

Have you ever considered creating your own chatbot? Chatbots have become a popular tool for both organizations and individuals as artificial intelligence and natural language processing have grown in popularity.

Whether you’re looking to improve customer service, automate repetitive tasks, or simply have a virtual assistant to talk to, chatbots are a great way to get started with artificial intelligence and natural language processing.

I’ll walk you through the process of creating a simple chatbot with Python and the Natural Language Toolkit. (NLTK). You’ll learn how to preprocess and tokenize text, how to train a machine learning model, and how to use that model to generate replies to user input.

So, if you’re up for this task, let’s get started!

Step 1: Install the Required Libraries

Before we start building our chatbot, we need to install NLTK for text processing and scikit-learn for machine learning.

To install these libraries, open your terminal or command prompt and enter the following commands:

pip install nltk
pip install scikit-learn

Step 2: Import NLTK and Download Required Corpora

Next, we must import the NLTK library and download the required corpora. Corpora are large collections of text data that are used to train natural language processing models. In this case, we will be using the NLTK movie_reviews corpus to train our chatbot. You can download the corpus by running the following commands in a Python script or in a Python shell:

import nltk

nltk.download('movie_reviews')

Step 3: Preprocess and Tokenize the Data

After downloading the movie_reviews corpus, we must preprocess and tokenize the data. Preprocessing is the process of cleaning and normalizing text data, whereas tokenization is the process of breaking text into individual words or tokens.

from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

def preprocess_text(text):
    # Tokenize the text into individual words
    tokens = word_tokenize(text.lower())
    # Remove stopwords and punctuation
    stop_words = set(stopwords.words('english') + list(string.punctuation))
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # Return the filtered tokens as a string
    return ' '.join(filtered_tokens)

# Example usage
text = "This is an example sentence. We're going to tokenize it."
preprocessed_text = preprocess_text(text)
print(preprocessed_text)

This code tokenizes the input text, lowercases it, removes stopwords and punctuation, and returns the filtered tokens as a string.

Step 4: Extract Features from the Text Data

Now that we have preprocessed our text data, we need to extract features that our machine learning model can use to make predictions. We will be using a bag-of-words approach, which involves counting the frequency of each word in the text. We will use the CountVectorizer class from the scikit-learn library to create the bag-of-words model.

from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer object to extract features
vectorizer = CountVectorizer()
# Fit the vectorizer to the preprocessed text
corpus = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()]
vectorizer.fit_transform([preprocess_text(text) for text in corpus])

This code will generate a CountVectorizer object and apply it to the previously processed movie review data. The generated object will be used to convert user input into a numerical feature vector that our machine learning model can use.

Step 5: Train a Machine Learning Model

We can now train a machine learning model to generate predictions based on user input now that we have our features. We will use a Naive Bayes classifier, which is a popular text classification algorithm.

The following is the code for training the classifier:

from sklearn.naive_bayes import MultinomialNB
import random

# Create a list of (preprocessed text, category) tuples
corpus = [(preprocess_text(movie_reviews.raw(fileid)), category)
          for category in movie_reviews.categories()
          for fileid in movie_reviews.fileids(category)]  # Added the missing parts

# Shuffle the corpus to ensure a random distribution
random.shuffle(corpus)

# Split the corpus into features and labels
texts, labels = zip(*corpus)

# Train a Multinomial Naive Bayes classifier
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
clf = MultinomialNB()
clf.fit(X, labels)

Step 6: Build your Chatbot

We can utilize our Naive Bayes classifier to develop our chatbot now that we have trained it. We will define a function that accepts user input, preprocesses and tokenizes it, and then predicts a response using the classifier.


# Define a function to generate a chatbot response
def generate_response(user_input):
    # Preprocess and tokenize the user input
    preprocessed_input = preprocess_text(user_input)
    input_vector = vectorizer.transform([preprocessed_input])
    # Use the classifier to predict a response
    predicted_category = clf.predict(input_vector)[0]
    # Choose a random movie review from the predicted category
    reviews_in_category = movie_reviews.fileids(predicted_category)
    review_id = random.choice(reviews_in_category)
    review_text = movie_reviews.raw(review_id)
    # Return the review text as the chatbot response
    return review_text

This function can now be used to generate responses to user input.
For example:

user_input = input("Hi! How can I help you today? ")
response = generate_response(user_input)
print(response)

With this code, your chatbot will prompt the user for input and provide a response depending on the input provided by the user.

Congratulations! You have now created your very own chatbot using Python and NLTK. This is only the beginning.

There are other ways to modify and customize your chatbot to make it more useful and entertaining.

You can Try:

— Adding more training data:

ChatterBot’s official documentation on training data:

Training - ChatterBot 1.0.8 documentation

ChatterBot includes tools that help simplify the process of training a chat bot instance. ChatterBot's training process…

chatterbot.readthedocs.io

A list of pre-built conversation datasets for ChatterBot:

GitHub - gunthercox/chatterbot-corpus: A multilingual dialog corpus

These modules are used to quickly train ChatterBot to respond to various inputs in different languages. Although much…

github.com

— Integrating with other APIs and platforms:

A tutorial on integrating a chatbot with Slack:

What is Slack Chatbot and how to create it? (2023 Tutorial)

Want to save time by automatically replying to messages in Slack? Then, you should create a Slack chatbot, which will…

chatimize.com

— Giving the chatbot a personality:

A tutorial on creating a personality for your chatbot using RiveScript:

Tutorial

RiveScript::Tutorial - Learn to write RiveScript code. This tutorial will help you learn how to write your own chatbot…

www.rivescript.com

I hope you found this article informative and helpful, and that it inspired you to explore the fascinating world of natural language processing and chatbots. Don’t hesitate to experiment and try out new things, and remember that the most important thing is to have fun and enjoy the learning process.

Thank you for reading, and happy coding!