Shakespeare’s Famous Play, Rewritten by an AI Poet

Darren Su
12 min readApr 2, 2023
Illustration: Chad Hagen

Shakespearean plays are some of the most celebrated works in the literary world, renowned for their intricate plots, unforgettable characters, and eloquent language.

“All the world’s a stage, and all the men and women merely players.”

While Shakespeare himself is considered one of the greatest writers in history, modern technology is catching up to human creativity, bringing new and exciting possibilities for the art of playwriting.

Natural Language Processing (NLP) is one such technology that has captured the attention of writers and scholars alike, providing the means to generate scripts that capture the essence of Shakespeare’s style.

In this article, we’ll try and recreate that same feel by training an NLP to write Shakespearean scripts.

But first and foremost, what exactly are we building?

Natural Language Processing

We’re going to develop a model that falls in the AI Subset of NLP. NLP stands for “Natural Language Processing” and stems into two parts, NLG and NLU; Natural Language Generation and Natural Language Understanding respectively.

NLP Diagram

As you can most likely guess from its name, the NLG is focused on generating text while the NLU focuses on understanding the text. Together, these two parts can form an AI that generates text by analyzing data.

Still, an NLP model is extremely vague, due to just how many models fall under that category. So let’s narrow this down a bit.

Recurrent Neural Network

We’re using an RNN(Recurrent Neural Network), or more specifically an LSTM(Long Short-Term Memory Networks), which falls under the intersection between DL(Deep Learning) and NLP.

A Recurrent Neural Network (RNN) is a type of artificial neural network that can process sequential data, such as English.

RNN Diagram

Using a feedback loop, the model can retain its previous output as part of the input for the next time step, thus enabling it to learn and remember patterns in the input data over time.

This makes RNNs particularly useful in text prediction since they can use the entire text to predict the next character, not just a single word.

However, RNNs aren’t exactly perfect. Traditional RNNs encountered the “vanishing gradient problem” when processing lots of data with lots of layers.

Vanishing Gradient Problem

This occurs when the partial derivative (which make up the weight) of the loss function nears zero rendering the loss function ineffective, as it is no longer relevant when updating the weights.

Graph Of Loss Function Derivative Value

Let’s look at an example (assume the sigmoid output is 0.5):

Output (Y) = 0.5 * … 0.5 * x + b

See, when the weight is passed on to the next layer of RNN, it slowly becomes smaller and smaller. If the model has lots of layers, the weight will be practically nonexistent at the end of the last layer, not allowing it to improve.

This jeopardizes the whole network, as the whole point of AI is to improve.

To counter this problem we’re going to use an LSTM(Long Short-Term Memory) Network.

Long Short-Term Memory Network

An LSTM is a form of RNN. However, unlike an RNN, remembers data at random intervals, making it insensitive to big models. The LSTM stores summarized (using “sigmoid” and “tahn” functions) data in the “cell state”, while forgetting other derivates.

It then “releases” data at random points during backpropagation.

LSTM Layer Diagram

Lets go through the parts in the diagram one by one:

  1. Cell State: This is the memory unit of the LSTM. It acts as a conveyor belt, passing information from one time-step to the next. It can add or remove information as necessary and has a separate way of passing information without the interference of gates.
  2. Input Gate: The input gate decides what information to add to the cell state. It is controlled by a sigmoid function, which outputs values between 0 and 1, determining how much of the new input should be added to the cell state.
  3. Forget Gate: The forget gate decides what information to remove from the cell state. It is controlled by a sigmoid function, which outputs values between 0 and 1, determining how much of the existing cell state should be removed.
  4. Output Gate: The output gate decides what information to output from the cell state. It is controlled by a sigmoid function and a tanh function. The sigmoid function decides which parts of the cell state to output, and the tanh function determines the values of those parts.
  5. Hidden State: The hidden state is the output of the LSTM at each time-step. It is determined by the input, previous hidden state, and the current cell state.
  6. Peephole connections: These are connections that allow the input and forget gates to look into the cell state to make their decisions.

Together, these parts work to produce a network unburdened with the vanishing gradient problem, and works to stall the vanishing gradient problem. The LSTM is a reliable model to evade this problem; of course, you can’t put piles of layers and expect no vanishing gradient, but LSTM removes most of the risk.

Now we know what we’re going to build, let’s get straight into building the actual model!

Time To Get Building!

import numpy as np
import tensorflow as tf
from tensorflow.python.ops.numpy_ops import np_config
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import random
import time
import os

Preprocessing Data

The required libraries are imported, including NumPy for numerical computations, TensorFlow for building and training the model, Matplotlib and Seaborn for visualizations, Tqdm for a progress bar, Random for generating random numbers, and Time for measuring time.

def get_vocab(file, lower = False):
with open(file, 'r') as fopen:
data =
if lower:
data = data.lower()
vocab = list(set(data))
return data, vocab

The get_vocab() function reads a text file and returns the contents of the file as well as a vocabulary list consisting of unique characters in the text.

def embed_to_onehot(data, vocab):
onehot = np.zeros((len(data), len(vocab)), dtype = np.float32)
for i in range(len(data)):
onehot[i, vocab.index(data[i])] = 1.0
return onehot

The embed_to_onehot() function converts the input text into one-hot encoded vectors, which are required to train the LSTM model.

One-hot encoding
text, text_vocab = get_vocab('/content/shakespeare.txt', lower = False)

The text and text_vocab variables are obtained by calling the get_vocab() function on the shakespeare.txt file.

learning_rate = 0.01
batch_size = 128
sequence_length = 64
epoch = 10000
num_layers = 5
size_layer = 512
possible_batch_id = range(len(text) - sequence_length - 1)p

The hyperparameters for the model are set.

Building The Layers

class Model:
def __init__(self, num_layers, size_layer, dimension, sequence_length, learning_rate):

The class of the model is defined, which takes several parameters as inputs:

  • num_layers: the number of LSTM layers to use in the network.
  • size_layer: the number of hidden units in each LSTM layer.
  • dimension: the dimensionality of the input and output vectors.
  • sequence_length: the length of the input and output sequences.
  • learning_rate: the learning rate for the optimizer used to train the network.
        def lstm_cell():
#Kept in mind all of this is still inside the class function
return tf.compat.v1.nn.rnn_cell.BasicLSTMCell(size_layer, state_is_tuple=False)

The LSTM cells are created using the LSTMCell()function.

Stacked LSTM
        self.rnn_cells = tf.compat.v1.nn.rnn_cell.MultiRNNCell(
[lstm_cell() for _ in range(num_layers)], state_is_tuple=False)

Multiple LSTM layers are stacked together using the MultiRNNCell() function.

        self.X = tf.compat.v1.placeholder(tf.float32, (None, None, dimension))
self.Y = tf.compat.v1.placeholder(tf.float32, (None, None, dimension))
self.hidden_layer = tf.compat.v1.placeholder(tf.float32,
(None, num_layers * 2 * size_layer))

Three placeholders are defined:

X: a 3D tensor of shape (batch_size, sequence_length, dimension), which holds the input sequences.

Y: a 3D tensor of shape (batch_size, sequence_length, dimension), which holds the target output sequences.

hidden_layer: a 2D tensor of shape (batch_size, num_layers * 2 * size_layer), which holds the initial hidden and cell states of the LSTM cell.

        self.outputs, self.last_state = tf.compat.v1.nn.dynamic_rnn(
initial_state = self.hidden_layer,
dtype = tf.float32)

A dynamic RNN is created using the dynamic_rnn function from TensorFlow, which takes as input:

  • the multi-layer LSTM cell created earlier.
  • the input sequences X.
  • the initial hidden and cell states in “hidden_layer”.
  • the data type to use (float32 in this case).

The outputs of the dynamic RNN are two tensors:

  • outputs: a 3D tensor of shape (batch_size, sequence_length, size_layer), which holds the hidden states of the LSTM cell at each time step.
  • last_state: a 2D tensor of shape (batch_size, num_layers * 2 * size_layer), which holds the final hidden and cell states of the LSTM cell.
        rnn_W = tf.Variable(tf.random.normal((size_layer, dimension)))
rnn_B = tf.Variable(tf.random.normal([dimension]))
self.logits = (tf.matmul(tf.reshape(self.outputs, [-1, size_layer]), rnn_W) + rnn_B)

The output of the LSTM is passed through a fully connected layer to obtain the logits, which are used to calculate the loss using the softmax cross-entropy function.

self.optimizer = tf.compat.v1.train.RMSPropOptimizer(learning_rate, 0.9).minimize(self.cost)
y_batch_long = tf.reshape(self.Y, [-1, dimension])
self.cost = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(logits = self.logits, labels = y_batch_long))
RMSProp Formula

An optimizer is defined using the RMSPropOptimizer algorithm with a learning rate of “learning_rate”.

        self.correct_pred = tf.equal(
tf.argmax(self.logits, 1), tf.argmax(y_batch_long, 1))

The “correct_pred” tensor is created to check how many predictions were correct by comparing the predicted class labels (the maximum value of the logits) with the actual class labels in Y.

self.accuracy = tf.reduce_mean(tf.cast(self.correct_pred, tf.float32))

The “accuracy” tensor is created to calculate the percentage of correct predictions.

seq_shape = tf.shape(self.outputs)
self.final_outputs = tf.reshape(
tf.nn.softmax(self.logits), (seq_shape[0], seq_shape[1], dimension))

The “final_outputs” tensor is created to hold the final predictions, which are obtained by applying the softmax function to the logits.

More Preprocessing & Setting Parmaters

model = Model(num_layers, size_layer, len(text_vocab), sequence_length, learning_rate)

The TensorFlow graph is reset, and an interactive session is created. The Model class is instantiated with the hyperparameters defined earlier, and the global variables are initialized.

split_text = text.split()
tag = split_text[np.random.randint(0, len(split_text))]

The split_text variable contains the text split into a list of words.

def train_random_sequence():

The function starts by defining two empty lists, LOST and ACCURACY, which will be used to store the loss and accuracy values during training.

pbar = tqdm(range(epoch), desc = 'epoch')

The function then creates a tqdm progress bar object that will be used to visualize the progress of training over the specified number of epochs.

for i in pbar:
last_time = time.time()
init_value = np.zeros((batch_size, num_layers * 2 * size_layer))

The range() function specifies the number of epochs, and the desc argument provides a description for the progress bar.

The function initializes a zero matrix for the initial hidden state of the model, with dimensions (batch_size, num_layers * 2 * size_layer).

        batch_x = np.zeros((batch_size, sequence_length, len(text_vocab)))
batch_y = np.zeros((batch_size, sequence_length, len(text_vocab)))

The num_layers, size_layer, and batch_size variables are hyperparameters that determine the architecture of the model.

The function initializes two zero matrices for the input and output sequences of the model, with dimensions (batch_size, sequence_length, len(text_vocab)).

batch_id = random.sample(possible_batch_id, batch_size)

The sequence_length variable determines the length of each sequence input to the model, and the len(text_vocab) variable specifies the size of the vocabulary for the language model.

The function randomly selects batch_size indices from the list of possible batch indices, which is a list of indices that span the entire text corpus.

for n in range(sequence_length):
id1 = embed_to_onehot([text[k + n] for k in batch_id], text_vocab)
id2 = embed_to_onehot([text[k + n + 1] for k in batch_id], text_vocab)
batch_x[:,n,:] = id1
batch_y[:,n,:] = id2

For each element in the sequence length, the function creates two one-hot encoded matrices, id1 and id2, that represent the input and output sequences for that element.

The matrices are created by embedding the characters at the specified indices into one-hot encoded vectors using the embed_to_onehot() function.

The resulting matrices are then assigned to the corresponding positions in the batch_x and batch_y matrices.


last_state, _, loss =[model.last_state, model.optimizer, model.cost], 
feed_dict = {model.X: batch_x,
model.Y: batch_y,
model.hidden_layer: init_value})

The function then runs a TensorFlow session to update the model parameters for the current batch of data.

The session runs three TensorFlow operations:

  • model.last_state
  • model.optimizer
  • model.cost.

The last_state operation returns the final hidden state of the model, which is then used as the initial hidden state for the next batch of data.

The optimizer operation updates the model parameters using the backpropagation algorithm to minimize the cost (or loss) function.

The cost operation returns the value of the cost function for the current batch of data.

accuracy =, feed_dict = {model.X: batch_x, 
model.Y: batch_y,
model.hidden_layer: init_value})

The function then calculates the accuracy of the model for the current batch of data using the model.accuracy TensorFlow operation.

ACCURACY.append(accuracy); LOST.append(loss)
init_value = last_state
pbar.set_postfix(cost = loss, accuracy = accuracy)

The current values of loss and accuracy are then appended to the LOST and ACCURACY lists, respectively.

Trainging the Model

LOST, ACCURACY = train_random_sequence()

The train_random_sequence() function trains the LSTM model using randomly selected sequences of characters from the input text.

The function loops over the specified number of epochs, and for each epoch, it selects a random batch of sequences from the input text.

Representing the Output

plt.figure(figsize = (15, 5))
plt.subplot(1, 2, 1)
EPOCH = np.arange(len(LOST))
plt.plot(EPOCH, LOST)
plt.xlabel('epoch'); plt.ylabel('loss')
plt.subplot(1, 2, 2)
plt.xlabel('epoch'); plt.ylabel('accuracy')

The “plt function” is used to plot down the loss and accuracy during each epoch. This allows for a visual representation of how well the model is learning to predict over time, and if we’re experiencing any problems (such as the vanishing gradient problem).

def generate_based_sequence(length_sentence, argmax = False):
sentence_generated = tag
onehot = embed_to_onehot(tag, text_vocab)
init_value = np.zeros((1, num_layers * 2 * size_layer))
for i in range(len(tag)):
batch_x = np.zeros((1, 1, len(text_vocab)))
batch_x[:, 0, :] = onehot[i, :]
last_state, prob =
[model.last_state, model.final_outputs],
feed_dict = {model.X: batch_x, model.hidden_layer: init_value},
init_value = last_state

for i in range(length_sentence):
if argmax:
char = np.argmax(prob[0][0])
char = np.random.choice(range(len(text_vocab)), p = prob[0][0])
element = text_vocab[char]
sentence_generated += element
onehot = embed_to_onehot(element, text_vocab)
batch_x = np.zeros((1, 1, len(text_vocab)))
batch_x[:, 0, :] = onehot[0, :]
last_state, prob =
[model.last_state, model.final_outputs],
feed_dict = {model.X: batch_x, model.hidden_layer: init_value},
init_value = last_state

return sentence_generated

This function generates a sequence of text of a specified length based on a given initial tag.

It uses a neural network model to predict the next character in the sequence based on the previous character(s) and generate the text accordingly.

The argmax argument determines whether the model should choose the character with the highest probability or a random character based on the predicted probabilities.

Lastly, the generated sentence is returned as output.


Finally, we can print out a generated sequence!

Drum Roll…..
I am ward keeper born well needs is of least,
i, leave now of heart:
And marry in carre all beams was about!

As I swear, Duke of Edward; I keep where.
far have wise;
I knew me

Sink where you're here sought beseech


Nay, here?

I am no chrowine.
What; how she shall love for wherefore,
I waker here as I send, and

Let's make an even as fear, agains great was men
Now, sir, mag, as came that I shy,
We are enemies aloof fear to resume,
Nay, might fear for your tasks, anoud?

Your remember!

Nor! Never lively!

If I save a request'd that old make
Nor lawful say he's from village
Is in Italy; see

Some and so much against nothings
Will hear these armer-marresteser interim?