I, Storytelling Bot

How a little bot spin narratives through machine learning

Published in

Analytics Vidhya

12 min readDec 28, 2019

TLDR: I have created a simple bot to generate new text based on seeding text randomly chosen or entered. The final prediction candidate was a Deep Learning model using Keras/Tensorflow libraries: An LSTM model using word-tokenisation and pre-trained Word2vec (Gensim). The model was trained on a dataset built with free short stories available from Pathfinder.

The complete codes and notebooks can be found here.

Introduction
Model Building and Selection
Step 0. Project Folder Formatting
Step 1. Raw Data Sourcing
Step 2. Data Preparation
Step 3. Data Preprocessing
Step 4. Model Training
Step 5. Sample Text Prediction
Step 6. Model Selection
Step 7. Text Visualization
Step 8. Final Model Text Generation
Conclusions

Introduction

This was my final project in Metis Data Science Bootcamp. I wanted to try something different from my previous works, which dealt with predictions and image classifications. Hence, under my instructor’s guidance, I turned my attention to creativity through machine learning — text generation bot.

Text generation is one of many NLP (Natural Language Processing) applications. It can be both challenging and fun. And off we go…

Model Building and Selection

You will find below the steps I took to build towards the final working model. I might sometimes illustrate only details from one of the three models to avoid being repetitive.

Step 0. Project Folder Formatting

I would prefer my codes (and documentation) to be as organized as possible and thus I leveraged some of the suggestions from here and here.

I have packaged commonly used modules into libraries stored here:

nlplstm_class.py — Library of classes to encapsulate GPU usage, as well as for NLP using Keras/Tensorflow LSTM models
data-common.py — Data loading, data saving, data pre-processing and text generation related functions
text_viz_common.py — Functions that visualize generated text using spaCy library

Without going into too much details, the customized LSTM classes kept things sane for me. I could at anytime easily train all of my models using GPU on Google Colab or any cloud platforms (for speed), while separately reload and predict with the trained models on my laptop without a powerful GPU (for conveniences or as a cheaper option).

Step 1. Raw Data Sourcing

In the past, I have enjoyed a couple of Pathfinder games with my son. He was always one of the players, while I ran these games as the DM (dungeon master). A DM wears a lot of hats, one of which is to provide narratives to keep the game going.

Needless to say, I saw this as a ML learning opportunity: Let’s teach a bot how to generate random, but useful and relevant short narratives based on Pathfinder stories it has learnt.

Fortunately, I was able to download free short adventures from Pathfinder website, which became the dataset I used to train my model — all 15,000 usable words.

You can find the downloaded pdfs under “Pathfinder Beginner Box” subfolders here.

Step 2. Data Preparation

I have applied the following mini-steps (mix of manual/scripted) to produce a single dataset suitable for our machine learning processes down the pipeline.

Convert PDFs into text files (you can do this with any free online converters)
Strip off chunks of text that do not contain any valuable story elements
Concatenate text files into a single raw text file here
Retain only proper sentences using this and save it here.

The resulting “textgen_pathfinder.txt” would be used as the source file for creating the dataset required to train the models below.

Step 3. Data Preprocessing

After a couple of initial exploratory experiments, I have decided to try out three different LSTM models, before settling for the best candidate:

I will take a closer look at how the third model was built, since it employed more or less the same techniques (tokenisation, embedding, LSTM/RNN, classification) as the first two models. First, we need to import these libraries:

from keras.preprocessing.text import Tokenizer
from keras.utils import to_categoricalimport string
import textwrap
import picklefrom lib.nlplstm_class import (TFModelLSTMCharToken, TFModelLSTMWordToken, TFModelLSTMWord2vec) 
from lib.data_common import (load_doc, save_doc, clean_doc, prepare_char_tokens)
from lib.data_common import (build_token_lines, prepare_text_tokens, load_word2vec)
from lib.data_common import pathfinder_textfile, fixed_length_token_textfile

Next, we load the document and tokenize it with this function:

def clean_doc(doc):
 # replace '--' with a space ' '
 doc = doc.replace('--', ' ') # replace '-' with a space ' '
 doc = doc.replace('-', ' ') # split into tokens by white space
 tokens = doc.split() # remove punctuation from each token
 table = str.maketrans('', '', string.punctuation)
 tokens = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic
 tokens = [word for word in tokens if word.isalpha()] # make lower case
 tokens = [word.lower() for word in tokens]
 return tokens

Organize these tokens into fixed-length lines of tokens and save it:

def build_token_lines(tokens, length=50):
 length += 1
 lines = list()
 for i in range(length, len(tokens)):
  # select sequence of tokens
  seq = tokens[i-length:i]
  # convert into a line
  line = ' '.join(seq)
  # store
  lines.append(line)
 return lines

Effectively, these fixed-length lines will be the dataset used to train the model.

Let’s use above paragraph as a toy example dataset of 14 words and let us set the fixed-length at 5 words. This toy example model will then be trained to predict the next word by learning from the previous 5 words. The resulting toy example training dataset would look like this:

Thus, that is how our actual dataset will look like too. We will then transform the fixed-length text tokens into a format ready for LSTM training:

def prepare_text_tokens(lines):
 # integer encode sequences of words
 tokenizer = Tokenizer()
 tokenizer.fit_on_texts(lines)
 sequences = tokenizer.texts_to_sequences(lines) # vocabulary size
 vocab_size = len(tokenizer.word_index)
 #print(tokenizer.word_index) # split into X and y
 npsequences = np.array(sequences)
 X, y = npsequences[:,:-1], npsequences[:,-1]
 y = to_categorical(y, num_classes=vocab_size+1)
 seq_length = X.shape[1]
 
 return X, y, seq_length, vocab_size, tokenizer

The third model is special in the sense that it uses pre-trained weights to sorta jumpstart its training. So, we will need to build this set of weights by training a Gensim Word2vec model with our fixed-length lines of tokens:

def load_word2vec(lines):
  # split tokens up per line for Gensim Word2vec consumption
  sentences = [line.split() for line in lines]  print('\nTraining word2vec...')
  # workers=1 will ensure a fully deterministrically-reproducible run, per Gensim docs
  word_model = Word2Vec(sentences, size=300, min_count=1, window=5, iter=100, workers=1)
  pretrained_weights = word_model.wv.syn0
  vocab_size, emdedding_size = pretrained_weights.shape
  print('Result embedding shape:', pretrained_weights.shape)return vocab_size, emdedding_size, pretrained_weights

Step 4. Model Training

Once we have gathered all the necessary components from the previous step, we can define the LSTM model we will use for the third model:

The first layer we add to the model would be an embedding layer. The ‘vocab_size’ is the number of unique words in our dataset. Generally, most articles I have come across suggested arbitrarily setting the embedding dimensions (‘embedding_size’) within the range of 50–300. And this is also where we load the previously ‘pretrained_weights’ to the model.

self.model.add(Embedding(input_dim=vocab_size,
                         output_dim=embedding_size,
                         weights=[pretrained_weights]))

Now we add two LSTM layers and regularize with couple of dropouts.

self.model.add(self.select_LSTM(embedding_size,return_sequences=True))
self.model.add(Dropout(0.2))
self.model.add(self.select_LSTM(embedding_size))
self.model.add(Dropout(0.2))

We complete the model with two Dense layers, the latter being the output layer.

self.model.add(Dense(embedding_size, activation='relu'))
self.model.add(Dense((vocab_size+1), activation='softmax'))

With that in mind, let’s create an object of the above class by defining the parameters required:

# create new object that is an LSTM model using word tokenization
# and pre-trained Word2vec model from Gensim to generate text
textgen_model_3 = TFModelLSTMWord2vec(use_gpu=True)textgen_model_3.define(vocab_size=vocab_size, 
                       embedding_size=emdedding_size, 
                       pretrained_weights=pretrained_weights)

Fundamentally, this is a classification exercise, hence ‘categorical crossentropy’ will be used:

# compile model
textgen_model_3.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

We can now train the model and save the weights, so that we can reload these weights from a different cloud platform or a local machine:

# fit model
history = textgen_model_3.fit(X, y, batch_size=128, epochs=50)# serialize model weights to HDF5 and save model training history
textgen_model_3.save_weights_and_history(fname_prefix="./model/pathfinder_wordtoken_w2v_model_50_epoch")

This is the complete script of the preprocessing and model training processes for all 3 models:

Colab notebooks for each of the model are also available here.

Last but not least, we will also need to save the models once after each successful training done:

Respectively, the Colab notebook is available here.

Step 5. Sample Text Prediction

Now that we have trained our models and saved copies of each, let’s run some sample text predictions with all three models and see how each of them perform.

For model 1, it generates text by predicting one letter at a time, only to return the complete generated text when it reaches the ‘n_chars’ required:

For model 2 and 3, these will generate text by predicting the next word each time. The routine will return the generated text when it has reached the ‘n_words’ required’:

Both text generation routines employed a rather ingenious helper function to add a degree of randomness to each prediction. This is achieved by adding a hyperparameter known as ‘temperature’ or ‘diversity’.

Low temperature (near zero) will return predicted probabilities as it should be. The higher the temperature, the more random the return probabilities would be. This is best illustrated by testing the helper function on a toy probability array:

preds=[0.05, 0.1, 0.35, 0.5]
print([preds[sample_predict(preds,0.05)] for _ in range(10)])
print([preds[sample_predict(preds,1)] for _ in range(10)])
print([preds[sample_predict(preds,5)] for _ in range(10)])

Without the ‘temperature’, the prediction will always return ‘0.5’ as the largest probability. Adding temperatures now introduces an exciting level of uncertainty you can control.

[0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5] 
[0.5, 0.35, 0.35, 0.5, 0.5, 0.5, 0.1, 0.35, 0.5, 0.35] 
[0.05, 0.05, 0.35, 0.5, 0.05, 0.1, 0.5, 0.5, 0.35, 0.1]

For model 1, this is what I ran it with:

temperature_table = [0, 0.7]for temperature in temperature_table:
  generated = generate_seq_of_chars(textgen_model_1,
              num_unique_char, char2indices, indices2char,
              char_seed_text, maxlen, 300, temperature)  print(">> generated text (temperature: {})".format(temperature))
  print(textwrap.fill('%s' % (generated), 80))
  print()

And the results were as follows:

>> generated text (temperature: 0)  comes a dissong creck abread to eaper the ring to hear the sandpoint in the wind a small captain now a points and styyengess-demer, and a scccentions from a for the rap, the beliening of shobthess gropp and pcs who elemental in surprised to hel make a for gite and stealsh with a worken of golds wit  >> generated text (temperature: 0.7)  combat—yagg, and is she robb as magnimar’s hork samp, and as not a points and following the beat of gold, in simpating the mapical mumber she wastreaks he enter the pcs may of sandpoint’s strypeled betore and to searing the maps nom a can grack fagilies she remares staight acamem, and for sceeters

For model 2 and 3, I ran with this:

temperature_table = [0, 1.0]
for temperature in temperature_table:
  generated = generate_seq_of_words(textgen_model_2, tokenizer,
                       seq_length, word_seed_text, 100, temperature)  print(">> generated text (temperature: {})".format(temperature))
  print(textwrap.fill('%s' % (generated), 80))
  print()

And the results for model 2 was:

>> generated text (temperature: 0)
or knowledge religion check defeating the undead is easier if the pcs extinguish the candle of the development section with the skeletons defeated the pcs can deal with the candle of night with a successful dc knowledge arcana or knowledge religion check the pcs learn this minor magic item cannot be extinguished save by snuffing the flame with live flowing blood hazelindra adds that the pcs can
keep the candle as long as they do not tell the academy of her connection to this situation the cemetery is half a mile west of the town and is accessible via a>> generated text (temperature: 1.0)
or knowledge religion check defeating the undead is easier if the pcs extinguish the candle of the development section with the skeletons defeated the pcs can deal with the candle of night with a successful dc knowledge arcana or knowledge religion check the pcs learn this minor magic item cannot be extinguished save by snuffing the flame with live flowing blood hazelindra adds that the pcs can
keep the candle as long as they do not tell that about a long plum sized ruby calling it the fire of versade savasha versade has decided to display it publicly for the

And for model 3:

>> generated text (temperature: 0)
or knowledge religion check defeating the undead is easier if the pcs extinguish the candle of the development section with the skeletons defeated the pcs can deal with the candle of night with a successful dc knowledge arcana or knowledge religion check the pcs learn this minor magic item cannot be extinguished save by snuffing the flame with live flowing blood in order for the pcs to put out
its flame and prevent more undead from rising from graves along their path back to sandpoint they must douse the candle in blood from an open wound dealing at least points>> generated text (temperature: 1.0)
or knowledge religion check defeating the undead is easier if the pcs extinguish the candle of the development section with the skeletons defeated the pcs can deal with the candle of night with a successful dc knowledge local or knowledge religion check defeating the undead is easier if the pcs extinguish the candle of the development section with the skeletons defeated the pcs can deal with the candle of night with a successful dc knowledge arcana or knowledge religion check the pcs learn this minor magic item cannot be extinguished save by snuffing the flame with live flowing blood in

Here’s the complete code for generating text with various temperature for all three models:

The colab version is available here.

Step 6. Model Selection

When I was deciding on the final candidate model, I reviewed over several factors like time spent to train, model complexity, model accuracy, generated text coherence, etc.

In the end, I have settled for model 3, since it attained similar accuracy/loss against other two models faster, thus utilizing lesser resources. Also, when gauging from the sample text generations, the generated text was decently comprehensible in overall.

Step 7. Text Visualization

Now that we have a final candidate model, let’s add visualization values by using spaCy NER (named entity recognition) feature. In particular, I had to generate new named entities to recognize domain specific names from the Pathfinder fantasy world.

The first step was to collect lists and dictionaries of new named entity types:

god_name_list = ['Erastil', 'Aroden', 'Desna', 'Sarenrae']
race_name_list = ['Azlanti', 'Varisian', 'Thassilonian', 'Korvosan', 'Magnimarian']
...
sp_name_list = ['Burning', 'Hands']entity_names = {'GOD': god_name_list, 'RACE': race_name_list, 'ORG': org_name_list, 'MOB': mob_name_list, 'PER': per_name_list, 'LOC': loc_name_list, 'SP': sp_name_list}god_labels = ['Erastil', 'Aroden', 'Desna', 'Sarenrae']
race_labels = ['Azlanti', 'Varisian', 'Thassilonian', 'Korvosan', 'Magnimarian']
...
sp_labels = ['Burning Hands']entity_labels = {'GOD': god_labels, 'RACE': race_labels, 'ORG': org_labels, 'MOB': mob_labels, 'PER': per_labels, 'LOC': loc_labels, 'SP': sp_labels}

Next, we update the new named entities to spaCy matcher with the help of PhraseMatcher:

def get_matcher(nlp, entity_labels):
  matcher = PhraseMatcher(nlp.vocab)  for entity, label_list in entity_labels.items():
    entity_patterns = [nlp(text) for text in label_list]
    matcher.add(entity, None, *entity_patterns)
  
  return matcher

With that done, we can now use the updated matcher with displacy.render to highlight important names unique to Pathfinder.

doc = nlp(revised_text)
matches = matcher(doc)
spans = []for match_id, start, end in matches:
  # get the unicode ID, i.e. 'COLOR'
  rule_id = nlp.vocab.strings[match_id]    # get the matched slice of the doc
  span = doc[start : end]                  # print(rule_id, span.text)
  spans.append(Span(doc, start, end, label=rule_id))
  doc.ents = spansprint()
print('-'*95)options = {"ents": ['GOD','MOB','PER','LOC','RACE','ORG','SP'],
    "colors": {'GOD':'#f2865e','MOB':'#58f549','PER':'#aef5ef',
    'LOC':'pink','RACE':'#edcb45','ORG':'#d88fff', 'SP':'pink'}}print('Snaug_bot:')if using_notebook:
  displacy.render(doc, style='ent', jupyter=True, options=options)
else:
  displacy.render(doc, style='ent', options=options)

The complete functions to achieve these can be found here.

Step 8. Final Model Text Generation

Hurrah! We are now ready to spin up the final working model. I added a loop to check for user input. If you only hit <enter>, it will randomly pick one seeding text from the original dataset and use that to generate hopefully something cool. To quit, just type ‘quit’ and hit <enter>.

text_input='random'
while True:
  text_input = input("Enter seeding text or hit <ENTER> to automate or 'quit' to exit: ")
  
  if text_input=='quit':
    break
  else:
    if text_input=='':
      text_input='random'
      generate_and_visualize(lines, textgen_model, tokenizer,
             seq_length, nlp, matcher, entity_names, entity_labels,
             text_input=text_input)

The complete colab notebook can be found here and here’s an example output:

Conclusions

This has certainly been a fun project to work with, as I got to explore many different and sometimes diverse areas of ML. It’s pretty amazing to watch this simple app seemingly creating narratives all on its own. The only ingredients we have added were a few short stories with a common background, a fairly routine NLP LSTM model for text generation and some funky spaCy features to brighten things up a little.

All in all, we can definitely improve this model with below extended goals:

Automate some of the manual and semi-auto processes
Attempt sentence and document embedding
Look into generating proper sentences with punctuation and all
Add a helper function to match up user input text closer to seeding text
Explore other text generation models out there

I hope you have enjoyed reading this article and perhaps even find something useful for your own projects! :)