**Shakespearean** plays are some of the **most celebrated works** in the literary world, renowned for their **intricate **plots, **unforgettable** characters, and **eloquent** language.

“All the world’s a stage, and all the men and women merely players.”

While **Shakespeare himself** is considered **one of the greatest writers** in history, modern **technology** is **catching up** to human **creativity**, bringing **new and exciting possibilities** for the art of **playwriting.**

**Natural Language Processing **(NLP) is one such technology that has captured the attention of **writers and scholars** alike, providing the means to **generate scripts **that capture the essence of Shakespeare’s style.

In this article, we’ll try and recreate that same feel by training an NLP to write Shakespearean scripts.

But first and foremost, what exactly are we building?

# Natural Language Processing

We’re going to develop **a model** that falls in the **AI Subset **of **NLP**. NLP stands for “**Natural Language Processing**” and stems into two parts, **NLG** and** NLU**; Natural Language Generation and Natural Language Understanding respectively.

As you can most likely guess from its name, the NLG is focused on **generating text** while the NLU focuses on **understanding the text.** Together, these two parts can form an **AI that generates text by analyzing data.**

Still, an NLP model is **extremely vague,** due to just how many models fall under that category. So let’s **narrow** this down a bit.

# Recurrent Neural Network

We’re using an RNN(**Recurrent Neural Network**), or more specifically an LSTM(**Long Short-Term Memory Networks**), which falls under the intersection between **DL(Deep Learning) **and **NLP.**

A Recurrent Neural Network (RNN) is a type of **artificial neural network** that can process** sequential data**, such as English.

Using a** feedback loop**, the model can **retain **its **previous output** as part of the** input for the next time step,** thus enabling it to **learn and remember** patterns in the **input data over time.**

This makes RNNs **particularly useful in text prediction** since they can use the entire text to **predict the next character**, not just a **single word.**

However, RNNs aren’t **exactly** perfect. Traditional RNNs **encountered** the “**vanishing gradient problem” **when **processing** lots of data with **lots of layers.**

# Vanishing Gradient Problem

This occurs when the **partial derivative **(which make up the weight) of the **loss function nears zero rendering** the **loss function ineffective,** as it is no **longer relevant **when updating the **weights.**

Let’s look at an example (**assume the sigmoid output is 0.5**):

Output (Y) = 0.5 * … 0.5 * x + b

See, when the **weight **is passed on to the **next layer of RNN**, it slowly becomes** smaller and smaller**. If the model has **lots of layers**, the weight will be **practically nonexistent** at the end of the **last layer**, not allowing it to **improve.**

This jeopardizes the whole network, as the **whole point** of AI is to improve.

To counter this problem we’re going to use an LSTM(**Long Short-Term Memory)** Network.

# Long Short-Term Memory Network

An LSTM is a form of RNN. However, unlike an RNN, **remembers data **at **random intervals**, making it **insensitive **to big models. The LSTM stores **summarized** (using “**sigmoid**” and “**tahn**” functions) **data **in the “**cell state**”, while **forgetting** other **derivates**.

It then “**releases**” data at **random points **during** backpropagation**.

Lets go through the parts in the diagram one by one:

- Cell State: This is the memory unit of the LSTM. It acts as a conveyor belt, passing information from one time-step to the next. It can add or remove information as necessary and has a separate way of passing information without the interference of gates.
- Input Gate: The input gate decides what information to add to the cell state. It is controlled by a sigmoid function, which outputs values between 0 and 1, determining how much of the new input should be added to the cell state.
- Forget Gate: The forget gate decides what information to remove from the cell state. It is controlled by a sigmoid function, which outputs values between 0 and 1, determining how much of the existing cell state should be removed.
- Output Gate: The output gate decides what information to output from the cell state. It is controlled by a sigmoid function and a tanh function. The sigmoid function decides which parts of the cell state to output, and the tanh function determines the values of those parts.
- Hidden State: The hidden state is the output of the LSTM at each time-step. It is determined by the input, previous hidden state, and the current cell state.
- Peephole connections: These are connections that allow the input and forget gates to look into the cell state to make their decisions.

Together, these parts work to produce a network unburdened with the vanishing gradient problem, and works to **stall t**he **vanishing gradient problem.** The LSTM is a **reliable model** to** evade** this problem; of course, you can’t put **piles of layers** and expect **no **vanishing gradient, but **LSTM **removes** most **of the risk.

Now we know what we’re going to **build**, let’s get **straight into building** the actual model!

# Time To Get Building!

`import numpy as np`

import tensorflow as tf

from tensorflow.python.ops.numpy_ops import np_config

import matplotlib.pyplot as plt

import seaborn as sns

from tqdm import tqdm

import random

import time

import os

sns.set()

# Preprocessing Data

The required libraries are imported, including NumPy for numerical computations, TensorFlow for building and training the model, Matplotlib and Seaborn for visualizations, Tqdm for a progress bar, Random for generating random numbers, and Time for measuring time.

`def get_vocab(file, lower = False):`

with open(file, 'r') as fopen:

data = fopen.read()

if lower:

data = data.lower()

vocab = list(set(data))

return data, vocab

The `get_vocab()`

function reads a text file and returns the contents of the file as well as a vocabulary list consisting of unique characters in the text.

`def embed_to_onehot(data, vocab):`

onehot = np.zeros((len(data), len(vocab)), dtype = np.float32)

for i in range(len(data)):

onehot[i, vocab.index(data[i])] = 1.0

return onehot

The `embed_to_onehot()`

function converts the input text into one-hot encoded vectors, which are required to train the LSTM model.

`text, text_vocab = get_vocab('/content/shakespeare.txt', lower = False)`

The `text`

and `text_vocab`

variables are obtained by calling the `get_vocab()`

function on the `shakespeare.txt`

file.

`learning_rate = 0.01`

batch_size = 128

sequence_length = 64

epoch = 10000

num_layers = 5

size_layer = 512

possible_batch_id = range(len(text) - sequence_length - 1)p

tf.compat.v1.disable_eager_execution()

np_config.enable_numpy_behavior()

The hyperparameters for the model are set.

# Building The Layers

`class Model:`

def __init__(self, num_layers, size_layer, dimension, sequence_length, learning_rate):

The class of the model is defined, which takes several parameters as inputs:

- num_layers: the number of LSTM layers to use in the network.
- size_layer: the number of hidden units in each LSTM layer.
- dimension: the dimensionality of the input and output vectors.
- sequence_length: the length of the input and output sequences.
- learning_rate: the learning rate for the optimizer used to train the network.

` def lstm_cell():`

#Kept in mind all of this is still inside the class function

return tf.compat.v1.nn.rnn_cell.BasicLSTMCell(size_layer, state_is_tuple=False)

The LSTM cells are created using the `LSTMCell()`

function.

` self.rnn_cells = tf.compat.v1.nn.rnn_cell.MultiRNNCell(`

[lstm_cell() for _ in range(num_layers)], state_is_tuple=False)

Multiple LSTM layers are stacked together using the `MultiRNNCell()`

function.

` self.X = tf.compat.v1.placeholder(tf.float32, (None, None, dimension))`

self.Y = tf.compat.v1.placeholder(tf.float32, (None, None, dimension))

self.hidden_layer = tf.compat.v1.placeholder(tf.float32,

(None, num_layers * 2 * size_layer))

Three placeholders are defined:

`X`

: a 3D tensor of shape (batch_size, sequence_length, dimension), which holds the input sequences.

`Y`

: a 3D tensor of shape (batch_size, sequence_length, dimension), which holds the target output sequences.

`hidden_layer`

: a 2D tensor of shape (batch_size, num_layers * 2 * size_layer), which holds the initial hidden and cell states of the LSTM cell.

` self.outputs, self.last_state = tf.compat.v1.nn.dynamic_rnn(`

self.rnn_cells,

self.X,

initial_state = self.hidden_layer,

dtype = tf.float32)

A dynamic RNN is created using the `dynamic_rnn`

function from TensorFlow, which takes as input:

- the multi-layer LSTM cell created earlier.
- the input sequences X.
- the initial hidden and cell states in “hidden_layer”.
- the data type to use (float32 in this case).

The outputs of the dynamic RNN are two tensors:

- outputs: a 3D tensor of shape (batch_size, sequence_length, size_layer), which holds the hidden states of the LSTM cell at each time step.
- last_state: a 2D tensor of shape (batch_size, num_layers * 2 * size_layer), which holds the final hidden and cell states of the LSTM cell.

` rnn_W = tf.Variable(tf.random.normal((size_layer, dimension)))`

rnn_B = tf.Variable(tf.random.normal([dimension]))

self.logits = (tf.matmul(tf.reshape(self.outputs, [-1, size_layer]), rnn_W) + rnn_B)

The output of the LSTM is passed through a fully connected layer to obtain the logits, which are used to calculate the loss using the softmax cross-entropy function.

`self.optimizer = tf.compat.v1.train.RMSPropOptimizer(learning_rate, 0.9).minimize(self.cost)`

y_batch_long = tf.reshape(self.Y, [-1, dimension])

self.cost = tf.reduce_mean(

tf.nn.softmax_cross_entropy_with_logits(logits = self.logits, labels = y_batch_long))

An optimizer is defined using the `RMSPropOptimizer`

algorithm with a learning rate of “learning_rate”.

` self.correct_pred = tf.equal(`

tf.argmax(self.logits, 1), tf.argmax(y_batch_long, 1))

The “correct_pred” tensor is created to check how many predictions were correct by comparing the predicted class labels (the maximum value of the logits) with the actual class labels in Y.

`self.accuracy = tf.reduce_mean(tf.cast(self.correct_pred, tf.float32))`

The “accuracy” tensor is created to calculate the percentage of correct predictions.

`seq_shape = tf.shape(self.outputs)`

self.final_outputs = tf.reshape(

tf.nn.softmax(self.logits), (seq_shape[0], seq_shape[1], dimension))

The “final_outputs” tensor is created to hold the final predictions, which are obtained by applying the softmax function to the logits.

# More Preprocessing & Setting Parmaters

`tf.compat.v1.reset_default_graph()`

sess=tf.compat.v1.InteractiveSession()

model = Model(num_layers, size_layer, len(text_vocab), sequence_length, learning_rate)

sess.run(tf.compat.v1.global_variables_initializer())

The TensorFlow graph is reset, and an interactive session is created. The `Model`

class is instantiated with the hyperparameters defined earlier, and the global variables are initialized.

`split_text = text.split()`

tag = split_text[np.random.randint(0, len(split_text))]

print(tag)

The `split_text`

variable contains the text split into a list of words.

`def train_random_sequence():`

LOST, ACCURACY = [], []

The function starts by defining two empty lists, LOST and ACCURACY, which will be used to store the loss and accuracy values during training.

`pbar = tqdm(range(epoch), desc = 'epoch')`

The function then creates a tqdm progress bar object that will be used to visualize the progress of training over the specified number of epochs.

`for i in pbar:`

last_time = time.time()

init_value = np.zeros((batch_size, num_layers * 2 * size_layer))

The range() function specifies the number of epochs, and the desc argument provides a description for the progress bar.

The function initializes a zero matrix for the initial hidden state of the model, with dimensions (batch_size, num_layers * 2 * size_layer).

` batch_x = np.zeros((batch_size, sequence_length, len(text_vocab)))`

batch_y = np.zeros((batch_size, sequence_length, len(text_vocab)))

The num_layers, size_layer, and batch_size variables are hyperparameters that determine the architecture of the model.

The function initializes two zero matrices for the input and output sequences of the model, with dimensions (batch_size, sequence_length, len(text_vocab)).

`batch_id = random.sample(possible_batch_id, batch_size)`

The sequence_length variable determines the length of each sequence input to the model, and the len(text_vocab) variable specifies the size of the vocabulary for the language model.

The function randomly selects batch_size indices from the list of possible batch indices, which is a list of indices that span the entire text corpus.

`for n in range(sequence_length):`

id1 = embed_to_onehot([text[k + n] for k in batch_id], text_vocab)

id2 = embed_to_onehot([text[k + n + 1] for k in batch_id], text_vocab)

batch_x[:,n,:] = id1

batch_y[:,n,:] = id2

For each element in the sequence length, the function creates two one-hot encoded matrices, id1 and id2, that represent the input and output sequences for that element.

The matrices are created by embedding the characters at the specified indices into one-hot encoded vectors using the embed_to_onehot() function.

The resulting matrices are then assigned to the corresponding positions in the batch_x and batch_y matrices.

# Backpropagation

`last_state, _, loss = sess.run([model.last_state, model.optimizer, model.cost], `

feed_dict = {model.X: batch_x,

model.Y: batch_y,

model.hidden_layer: init_value})

The function then runs a TensorFlow session to update the model parameters for the current batch of data.

The session runs three TensorFlow operations:

- model.last_state
- model.optimizer
- model.cost.

The last_state operation returns the final hidden state of the model, which is then used as the initial hidden state for the next batch of data.

The optimizer operation updates the model parameters using the backpropagation algorithm to minimize the cost (or loss) function.

The cost operation returns the value of the cost function for the current batch of data.

`accuracy = sess.run(model.accuracy, feed_dict = {model.X: batch_x, `

model.Y: batch_y,

model.hidden_layer: init_value})

The function then calculates the accuracy of the model for the current batch of data using the model.accuracy TensorFlow operation.

`ACCURACY.append(accuracy); LOST.append(loss)`

init_value = last_state

pbar.set_postfix(cost = loss, accuracy = accuracy)

return LOST, ACCURACY

The current values of loss and accuracy are then appended to the LOST and ACCURACY lists, respectively.

# Trainging the Model

`LOST, ACCURACY = train_random_sequence()`

The `train_random_sequence()`

function trains the LSTM model using randomly selected sequences of characters from the input text.

The function loops over the specified number of epochs, and for each epoch, it selects a random batch of sequences from the input text.

# Representing the Output

`plt.figure(figsize = (15, 5))`

plt.subplot(1, 2, 1)

EPOCH = np.arange(len(LOST))

plt.plot(EPOCH, LOST)

plt.xlabel('epoch'); plt.ylabel('loss')

plt.subplot(1, 2, 2)

plt.plot(EPOCH, ACCURACY)

plt.xlabel('epoch'); plt.ylabel('accuracy')

plt.show()

The “plt function” is used to plot down the loss and accuracy during each epoch. This allows for a visual representation of how well the model is learning to predict over time, and if we’re experiencing any problems (such as the vanishing gradient problem).

`def generate_based_sequence(length_sentence, argmax = False):`

sentence_generated = tag

onehot = embed_to_onehot(tag, text_vocab)

init_value = np.zeros((1, num_layers * 2 * size_layer))

for i in range(len(tag)):

batch_x = np.zeros((1, 1, len(text_vocab)))

batch_x[:, 0, :] = onehot[i, :]

last_state, prob = sess.run(

[model.last_state, model.final_outputs],

feed_dict = {model.X: batch_x, model.hidden_layer: init_value},

)

init_value = last_state

for i in range(length_sentence):

if argmax:

char = np.argmax(prob[0][0])

else:

char = np.random.choice(range(len(text_vocab)), p = prob[0][0])

element = text_vocab[char]

sentence_generated += element

onehot = embed_to_onehot(element, text_vocab)

batch_x = np.zeros((1, 1, len(text_vocab)))

batch_x[:, 0, :] = onehot[0, :]

last_state, prob = sess.run(

[model.last_state, model.final_outputs],

feed_dict = {model.X: batch_x, model.hidden_layer: init_value},

)

init_value = last_state

return sentence_generated

This function generates a sequence of text of a specified length based on a given initial tag.

It uses a neural network model to predict the next character in the sequence based on the previous character(s) and generate the text accordingly.

The `argmax`

argument determines whether the model should choose the character with the highest probability or a random character based on the predicted probabilities.

Lastly, the generated sentence is returned as output.

`print(generate_based_sequence(1000,False))`

Finally, we can print out a generated sequence!

`CAMILLI AEFILIUR:`

I am ward keeper born well needs is of least,

i, leave now of heart:

And marry in carre all beams was about!

Dead!

DUKE VINCENTIO:

As I swear, Duke of Edward; I keep where.

far have wise;

I knew me

FRIAR LAUTfRDAND:

Sink where you're here sought beseech

TIANIN:

Now!

DUKE VINCENTIO:

Nay, here?

YORL:

I am no chrowine.

What; how she shall love for wherefore,

I waker here as I send, and

HORTHUMBRY:

Let's make an even as fear, agains great was men

Now, sir, mag, as came that I shy,

We are enemies aloof fear to resume,

Nay, might fear for your tasks, anoud?

'CAPETEN:

Your remember!

PRINCE:

Nor! Never lively!

KING HENRY VI:

If I save a request'd that old make

Nor lawful say he's from village

Is in Italy; see

POMDEY:

Some and so much against nothings

Will hear these armer-marresteser interim?