Getting Started with AI: Building an RNN from scratch and practicing resilience

Ada Choudhry
13 min readJul 13, 2023

Without extensive knowledge of AI, using the top-down approach for learning (also the least boring approach to learning challenging stuff imo)

I’ve struggled with programming. Yet, for as long as I know, I have loved building things. Crafts, model houses, even the basic programs in high school and now research projects. Yet, I had a love-hate relationship with coding professionally.

I loved its possibilities. Yet, I was never able to gain momentum in building programs.

If I started with the very basic concepts, I would quickly get bored and lose momentum. If I jumped into video tutorials for building projects, I would get baffled by the new terminology and get overwhelmed. I would initially like the challenging concepts that would come with theoretical classes on Machine Learning, yet the asynchronous nature of these courses didn’t help me be accountable.

I talked to my TKS director, Steven Ritchie, and he advised me to follow tutorials about concepts that interested me, instead of sitting through classes of linear algebra where I wondered how I would actually use this.

Thankfully, in high school, I had learned the basics of Java and Python, so I knew the basic syntax and concepts of these languages. Having accountability in school helped a lot with practicing sample programs. So, I wasn’t totally lost in what was happening in tutorials.

My advice to my past self would be to find something interesting and then find the easiest way to build it (to help me get started). And. Then. Just. Do. It.

This approach is called the top-down approach, where you get familiar with the macro picture first and as you develop knowledge about that subject, go deeper and understand all the intricacies of how things work.

There are several benefits to this approach:

  1. Because you start with the big picture first, you don’t get overwhelmed with new terminologies and ideas.
  2. It helps to build a practical approach to learning, as when you start to go deeper into a particular subject, you understand how it fits in the bigger picture and why it’s important. You learn theory, not because it is a requisite on your resume, but because you need it to understand what you’re building more clearly. Knowing why you’re learning it helps overcome boredom and stagnancy. It keeps us interested in what we’re learning, so it doesn’t feel like a chore.
  3. Because you know why you’re learning it, it helps you remember it better.

So for the macro picture, I had:

  • Done an Explore on AI in TKS: Explores are written and visual content in TKS to help students get a broad overview of the field. You could watch YouTube videos too, to understand at the macro level how a neural network works too!

Here is what I’ve learned about myself through this process:

  1. I learn better from written tutorials, as I can pause and think about the code.
  2. I need to have a desire to build what I’m building, to truly enjoy the process. When I was learning about AI because everyone else was doing it, the process was a bit draining, and I couldn’t stick with it. But as I was going deeper into synthetic biology, I realized how efficient research could be made through the use of AI. I was also building a project to help improve mental healthcare access and wanted to build an AI chatbot. Because of these reasons, I had the desire to build an AI myself.

Why did I choose RNNs?

As I wanted to eventually build a chatbot, I chose RNNs for their ability to work with sequential or time series data. This gives it the ability to solve temporal problems such as natural language processing (NLP), language translation and speech recognition. I had also built a recommendation deck for Alexa where we had proposed a similar AI, so I was curious to learn how it worked internally.

In the past, I had built a Generative Adversarial Network (GAN) but my experience was stressful as I was building it on an Android Tablet. But this time, my experience was smoother on a PC.

While building it, I googled a lot of new terms, scanned related articles and used ChatGPT to explain lines of code which I was not able to understand. I also make sure not to copy paste code, and try to recall and write it on my own. This gives me more time to think about the structure and meaning behind my lines.

I am writing this article to compile everything I have learned through the tutorial as a way to cement my understanding so that I can recall basic concepts next time as well.

The Process

I am going to be building a Shakespearean chatbot using a tutorial from Towards Data Science and this article is a compilation of what I have learned in the process.

Architecture of RNNs

The key feature of RNNs is their ability to have context about the information they are dealing with, as they can utilize information from previous time steps or positions in the sequence.

Source: Research Gate

The basic architecture of an RNN consists of recurrent units or cells that form a chain-like structure. Each cell takes an input, produces an output, and has a hidden state that serves as its memory. The hidden state from one time-step is fed back into the cell as input for the next time step, allowing the network to capture and utilize information from previous steps.

The basic structure of RNNs can be broken down into:

  1. Input Layer: The input is a sequence of data points in which each data point corresponds to a specific time step or position in the sequence. At each time step t, the RNN receives an input vector x(t).
  2. Recurrent Units: Recurrent units are the building blocks of RNNs. They maintain an internal state, also known as the hidden state or memory, which allows the network to retain information from previous time steps. The recurrent units process the input data sequentially, taking into account the current input and the previous hidden state, and produce an updated hidden state for the next time step. It is implemented using a simple recurrent neuron called the Long Short-Term Memory (LSTM) or the Gated Recurrent Unit (GRU).
  3. Hidden State: The hidden state of an RNN is a vector that represents the network’s memory or internal state at a particular time step. It captures information from previous time steps and influences the computation for the current time step. The hidden state is updated at each time step based on the current input and the previous hidden state, allowing the network to capture dependencies and patterns in the sequential data.
  4. Output Layer: The output layer of an RNN produces the desired output or prediction based on the processed sequential data. The output can vary depending on the task at hand. For example, in language modeling, the output layer can predict the next word in a sentence. In sentiment analysis, the output layer can predict the sentiment of a given text. The output layer can consist of one or more units, depending on the specific requirements of the task.

The tutorial is divided into six parts:

  1. Importing libraries
  2. Vectorizing the text
  3. Creating the dataset
  4. Building the Model
  5. Compiling and Training
  6. Generating Text

Importing Libraries

The main libraries we use are TensorFlow and NumPy.

  • TensorFlow: TensorFlow is an open-source machine learning framework developed by Google. It provides a comprehensive set of tools and libraries for building and deploying various types of machine learning models, including neural networks. TensorFlow supports both deep learning and traditional machine learning algorithms.
  • NumPy: NumPy (Numerical Python) is a fundamental library in Python for scientific computing. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently. NumPy serves as the foundation for many other scientific computing libraries in Python.

The data on Shakespeare’s plays is downloaded from TensorFlow’s website.

Vectorizing the text

In NLP, text is denoted by a multi-dimensional vector so that the machine can understand and weigh tokens. In this tutorial, to vectorize Shakespeare:

  1. Firstly, we load the text into a set. Sets are unordered iterables which do not allow duplicates. This allows us to have all the unique characters in the text.

2. Then, we create a mapping char2idx in which all the unique characters are given an index. For example, a might be given 1, b might be given 2 and so on. This helps us develop a way to encode text into integers.

3. Encoding: All the text characters in the Shakespeare text are converted into integers based on this mapping, and this is stored in a NumPy Array called text_as_int.

4. Decoding: To help with decoding later, all the unique characters are also stored in a NumPy Array called idx2char.

Creating the dataset

To create a dataset, we take the Encoded NumPy array and convert it into a TensorFlow object to split the data into batches. The length of each input in the dataset is limited to 100 characters.

So, now our dataset contains sequences of characters, but we need to convert it into a tuple to feed it into the RNN.

So, we create a custom mapping model to split each sequence into two tuples:

  • input_text = chunk[:-1]: This line slices the chunk sequence, excluding the last element. It creates a new sequence, input_text, which contains all elements of chunk except the last one. The purpose of this is to obtain the input sequence for the model.
  • target_text = chunk[1:]: This line slices the chunk sequence, excluding the first element. It creates a new sequence, target_text, which contains all elements of chunk except the first one. The purpose of this is to obtain the target sequence for the model.

By applying the map() method to the sequences dataset with the split_input_target function, each element (sequence) in the sequences the dataset is transformed into a tuple of (input_text, target_text). The resulting dataset will contain these tuples, where each tuple represents an input and target pair to be used for training an RNN model. The input sequence will be used as input to the model, and the target sequence will be used for training by comparing the predicted outputs with the target outputs.

Finally, we shuffle our dataset and split it into 64 sentence batches.

Building the Model

The model we have created has three layers:

  1. Embedding Layer: This layer represents an embedding layer that maps the input tokens to dense vectors of embedding_dim dimensions. The vocab_size parameter specifies the total number of unique words in the vocabulary, and the embedding_dim parameter determines the size of the embedding vectors. A word embedding is a learned representation for text where words that have the same meaning have a similar representation.
A visual representation of word embedding. Source: Jaron Collis

2. GRU Layer: This layer represents a Gated Recurrent Unit (GRU) layer. It is a recurrent layer that processes sequential data and captures temporal dependencies. The rnn_units parameter specifies the number of units (neurons) in the GRU layer. The return_sequences=True argument ensures that the layer returns output sequences rather than just the final output. The stateful=True argument allows the layer to maintain its internal state across batches during training. The recurrent_initializer parameter specifies the initialization method for the recurrent weights.

3. Dense Layer: This layer represents a fully connected (dense) layer that performs the final prediction or classification. It has vocab_size units, corresponding to the number of unique words in the vocabulary. This layer provides the output distribution over the vocabulary words.

These layers are stacked using a model container called Sequential which is provided by TensorFlow Keras. It allows you to stack layers sequentially, defining the flow of data through the model.

Compiling and Training

In this tutorial, we have chosen Adam as the optimizer and sparse categorical crossentropy function as the loss function.

  • Optimizer: In machine learning, an optimizer is an algorithm or method used to update the parameters or weights of a model in order to minimize the loss function and improve its performance during training. The optimizer plays a crucial role in the training process by determining how the model’s parameters are adjusted based on the computed gradients of the loss function. Adam (Adaptive Moment Estimation) is a popular optimizer that adapts the learning rate dynamically based on estimates of the first and second moments of the gradients.
  • Loss function: It is a mathematical function that quantifies the discrepancy between the predicted output of a model and the actual target output. The goal of the loss function is to measure how well the model is performing and guide the learning process by providing a measure of the error or mismatch between the predicted and true values. Since our output is always one of the 65 characters, this is a multiclass categorization problem. Categorical cross-entropy is used for multi-class classification problems. It measures the dissimilarity between the predicted class probabilities and the true class labels. It encourages the model to assign high probabilities to the correct class and low probabilities to the other classes. To be able to output integers such as [0], [1] instead of the format used in one-hot encoding, we must use a sparse categorical crossentropy function.

To execute this, we write the following code:

  • labels: This represents the true class labels of the input samples. It is assumed to be integer-encoded, where each value corresponds to the class index.
  • logits: This represents the predicted logits from the model. Logits are the raw, unnormalized output of the model before applying any activation function.

The function uses tf.keras.losses.sparse_categorical_crossentropy to calculate the loss between the labels and logits. This loss function specifically handles the case of integer-encoded class labels and expects the input to be logits (unnormalized scores) rather than probabilities.

The from_logits=True argument indicates that the logits input is not normalized with a softmax activation function. This is necessary when calculating the cross-entropy loss directly from the logits. The loss function internally applies the softmax activation function to the logits before calculating the cross-entropy loss.

By defining and using this custom loss function, you can incorporate it into your model during the training process and optimize the model’s parameters to minimize this loss, thereby improving the model’s performance on the multiclass classification task.

To load weights and save training performance, we also set a checkpoint directory as well as save training history to a variable named as history.

Generating Text

After training the model on a number of epochs, we load the latest checkpoint. Then we call the function which we used to create the model and pass the vocab size, embedding dimension, RNN units and batch size variables. Finally, we summarize the model.

Then, we need to give a few input instructions to our model to generate the text:

  • The number of characters to generate,
  • Vectorizing the input (from string to numbers),
  • An empty variable to store the result,
  • A temperature value to manually adjust the variability of the predictions,
  • Devectorizing the output and also feeding the output to the model again for the next prediction,
  • Joining all the generated characters to have a final string.

This generates the final prediction value which we can use to print the generated text.

Results

1st Iteration: Epochs = 10, Temperature = 1

The results are not quite expected! That must be due to low number of epochs I ran on my computer, as it was taking a lot of time. I will definitely monitor the results once I run with more epochs as the training data from the previous run is not stored. This goes to show that a lot of programming is trial and error!

Skills I practiced:

  1. Figure it out: Definitely the skill you have to use while programming. But you don’t have to figure everything out alone. There are many resources on the internet that can help such as ChatGPT and StackOverflow. Once you’ve tried everything out, you can reach out to people you know who might be familiar with the work and can help you figure it out. Having someone to get clarity from, especially when starting out, can be immensely valuable. It also helps with accountability.
  2. Patience: Needed it when my laptop was taking hours to train the model
  3. Resilience: I was resilient in training my model a couple of times in a few days to make sure I got valuable output.

TL,DR

  • The top-down approach is valuable in learning new things that are overwhelming and broad. It helps gain practical skills and keeps you engaged (even during linear algebra).
  • RNNs are a type of neural network which can have context about the information you’re feeding it. They are used in language translation, natural language processing and speech translation, to name a few.

So this was it! The tutorial is done! I am proud of building an RNN and I definitely have a lot more to learn in the future. This tutorial helped me see the abstract concepts in action and I will build another tutorial in the future to gain more clarity on the libraries we use in machine learning.

Until then, keep building, keep learning!

--

--