When NLP meets Shrek

Donkey! Let’s build a chatbot using Markov Chains!

Ade Balogun
Geek Culture
7 min readFeb 4, 2022

--

Made by me, the author.

We are living during an explosion of invention. I often think about how magical it is that anyone with a computer and an internet connection can create something never before conceived. While modern computing has made a lot of this possible, I think open-source code is the greatest enabler of this movement. Today, we’re going to use the awesome power of the internet and the past several decades of research in natural language processing (or NLP) to make something beautiful. We’re going to make a Shrek bot.

But first, some background…

We need a bit of a crash course on NLP. NLP is a topic under computer science which focuses on how computers can understand, process, and reproduce language. This is usually achieved with a model which represents some simulation of human language. These models can be structured differently; however, the language models are usually produced when we decompose a source document into its constituent parts and extract meaning from those parts. Once meaning or a pattern is identified at this low level, the parts can be reconstructed to produce the more complex meaning contained in human speech.

Stay with me now, we’re going to talk briefly about Markov chains (not as bad as it sounds). To summarize, Markov chains are produced when the probability of one outcome is described solely by N previous outcomes. This is useful for natural language processing. We’ll be using markov chains to structure our very simple language model. NLP is a vast field with many more layers of depth to explore, but hopefully this simple example will spark your curiosity for how technology is creating more conversant chatbots and algorithms.

To build this model, we’ll need to design a graph. A graph is a type of data structure that is useful for defining the relationship between one chunk of data and other chunks of data within the structure. These chunks of data are called ‘’vertices’’, and they are connected to each other by ‘’edges’’. The edges between vertices can vary in strength or just not exist entirely. In a directed graph, the relationship between vertices only runs in one direction. In other words, if vertex A “flows” to vertex B, then vertex B cannot “flow” back to vertex A.

Storytime!

Let’s use an analogy to better understand graphs. Imagine we’re leaving Lord Farquaad’s castle for our Swamp. Those are the two points on the map that we are carrying: our origin and destination respectively. Now let’s imagine that there are other stops along the way — maybe a witch’s hut or the town square, or the church or an abandoned castle. Some of these stops might help us get to our destination while others might be dead ends — in other words, not connected to the destination. If we are late for an appointment at the Swamp, our task might be to minimize how long it takes to reach our destination, in which case, we would travel along the fastest path, or set of connected stops, to the Swamp.

An example of an undirected graph, produced by me

This analogy is another way of describing how our language model will work. We will break up a training text into its individual words. These words are like the metaphorical stops from our example and will be the vertices of the graph. The edges that connect the words to each other describe how likely it is that word A is followed by word B in the example text. Notice that this graph is not directed because we can also store the likelihood of word B being followed by word A. Returning to the Farquaad analogy, our origin will be the seed word, or prompt, for the sentence. Our model will then randomly choose another word to follow based on the weights of the edges that are connected to it. This will continue until we reach our metaphorical destination. The destination can be decided by some arbitrary condition; perhaps we’ll end the sentence after 20 words or when we reach a period.

Let’s quickly recap what we have so far. We have an origin which is our seed word. For simplicity, this word will be the first word of our sentence, and it will prompt our model to generate the rest of the sentence. We also have our destination which is unknown, but it will be determined by some condition that we set. In our case, we decided to end the sentence after 20 words or when a period is suggested by the model. To connect the individual words between our origin and destination, we have our weighted edges which will be calculated during the training step using our Markov model. With this in mind, let’s start building!

Cleaning our data

First we’ll need to put together a training set. This could be literally any body of text that we want. However, we don’t want to train on just any collection of words. Remember that the edges between the vertices in our word graph will be defined by probability. These probabilities are determined by the training set; we can create more coherent, Shrek-like results by shaping our training set.This shaping process can get as granular as we want. We can decide that we want simple sentences that are always structured as “[subject][verb][optional: prepositional phrase/object/etc]”. We can optimize for logical sentences of a certain length. We omit certain words or phrases from the training set as a way to control the conversation topics from our finished model. However, these steps take some admittedly unglamorous work, so I’ll leave those decisions to you, dear reader.

For now, let’s just write some code. You can clone this project here. This file contains three key files, api/scraper.py, api/markov.py, and api/graph.py. We will walk through what each of them does and how they come together for the final product. First we need data. Fortunately, someone has gone through the trouble of transcribing Shrek the Movie. I’ve put the source document in the repository, but let’s talk through how we would clean it up if we had scraped it from the web. Assuming we managed to scrape the script for the movie from some webpage, we would want to remove text that is non-conversational. For example, we can’t have director’s notes, stage directions, or sound effects present in the training data. We might write a quick script to delete any of these phrases. We’ll also want to remove characters like parentheses, brackets, tabs, newlines, and quotation marks.

While it technically would be better practice for our simple model to remove whole sentences that contain these characters, we’ll only remove the characters themselves for simplicity (think about why we might want to remove the whole sentence: if we remove these characters, doesn’t the meaning of the sentence change?). With our data cleaned, we can start constructing the model!

Building our Word Graph

First, we’ll clean up the text to be readable by our algorithm. We will start by converting our source text into an array of words, but be careful not to change the ordering of these words. We might decide that we don’t want to change the casing of the words from the original document. This means that our graph will contain a vertex for “The” as well as a vertex for “the”; but also think about the implications of this. Are the types of words that would follow the capitalized “the” different from its lowercase? You can play with this and see how it affects the results.

Next, we’ll design our word graph as an adjacency list. An adjacency list is a way of representing a graph in which the relationship between a vertex and its connected vertices is represented as a value connected to an array. In other words we can give our data structure a value and get an array in return. We can use a hashmap (or dictionary in python talk) to store this relationship. I’ve already built these classes in graph.py. Our word graph class will be the set of all the words in our training set. Each word will have an associated array of words, and each word in the array will be weighted by the number of times that it appears directly after the input word.

Now that we’ve defined our data structure, all we have to do is fill it with data! This is the training process, and since our corpus is so small, it’ll be quick. Once our model is built and stored in an instance of WordGraph, we can pickle the model for future use so we don’t have to re-train.

Results

Now we have our Shrek-text generator! If you’ve cloned the project, just open terminal, navigate into the top directory of the project. Create a virtual environment by typing virtualenv env (however, you may first need to install the virtualenv command with pip3 install virtualenv). Then type source env/bin/activate && pip3 install -r requirements.txt and hit enter. Finally, run python3 run.py which should will start the script in terminal! Here are some of my results:

PROMPT: “Donkey”

RESPONSE: “Donkey , They never make it , ? You know , Well you’ve got to .”

RESPONSE: “Donkey . I’m a kemp wearing girl , what a mentally abused shading from the sun goes .”

RESPONSE: “Donkey , — Oh , Now , they ? no . I’m gonna see him ! .”

As you can see, our model mostly mimics human speech; it lacks some of the coherence of normal human speech. Still, we’ve accomplished so much with such a low-level lift! Let’s imagine all the ways we might improve this model. What if we used the two previous words to measure probability, instead of only one? What if we labeled our words as nouns, adjectives, verbs, etc.? What if we considered sentence length or the rarity of certain words when calculating our weights? What if we added more source texts (aka the script from Shrek 2)? Hopefully, you feel empowered to continue exploring these concepts on your own! Leave a comment if you build anything cool!

--

--

Ade Balogun
Geek Culture

Environmental Engineering, Columbia 2020. MSc. Candidate @ Imperial College London.