The Intuition of Recurrent Neural Networks

Published in

Geek Culture

6 min readFeb 7, 2022

remember those memories? ML does too 🙌 | Photo by Jon Tyson on Unsplash

This is an introductory article. I suggest you know the works behind gradient descent, some linear algebra, and basic DL knowledge🙂. However, its quite simple with many visuals, so approach with any level of knowledge!

There is a magic in recurrent neural networks that I have always cherished. I remember viewing Andrej Karpathy’s TypingAssistant LSTM model a while back. Within a few dozen minutes of training his model (with rather arbitrarily-chosen hyperparameters), it started to generate meaning that was on the edge of making sense.

How is it possible to extract the essence of previous terms to guess future words?

I was shocked. From my machine learning experience, I was familiar with a model’s competency to find and extract relationships in linear data, but what is the magic of understanding the theme, grammar, and style of a sentence?

Juergen Schmidhuber discovered this magical notion by adding a subtle sense of memory —or residual — into the traditional neural network. These are the traditional RNNs we see today.

Through this article, you will learn about the mathematical operations behind the LSTM and RNN and their varieties.

TL;DR

RNNs Simplified: What do the abbreviations stand for? How do RNNs and Feed-Forward nets differ? What are the architectures of the algorithm?
Going Deeper into the Concept: What are the varieties of the RNN? How do LSTMs function? What are your next steps?

RNNs Simplified

Introductory Example

Let’s assume you programmed an object-detection model where the input is a documentary of wild animals.

The result is accurate, yet in some cases, the model classifies wolf species as dogs — as they fall under the same family.

In this case, a neural network with residuals will perform well. If the computer understands the given documentary revolves around wild animals — from previous object-detection runs —, it can then simply use this knowledge to correctly predict wolf species. This is what an RNN does.

In RNNs the current output depends on its present value but also past inputs.

There is immense power in this type of neural network. It can identify the context for sentiment analysis, translate from languages, and other natural language possibilities— by remembering previous word meanings.

Grasping Memory from Past Inputs

In a traditional dense-layered feed-forward network, there are three distinct layers.

Input Layer: number of neurons correlates with the number of data features inputted.
Hidden Layers: deep learning computation is done. Hyperparameters such as the size of layers and the number of layers depend on the specific problem.
Output Layer: we arrive at the solution or probability after a softmax function.

Weights (w_ii) are constants that are tuned through gradient descent | Source

RNNs on the other hand allow memory to persist in feedforward neural networks by introducing a loop mechanism. This instrument recurrently passes prior outputs to the current location, acting as a highway for memory.

Contrary to FeedForward Networks RNNs allow information loops.

If we were to visualize this information loop the diagram would look as follows.

Information is recurrently recycled back into the hidden layer

Here, the hidden state is another method to articulate the highway of information. Notice, that this highway will continually be updated if there are multiple pieces of information to be processed through the RNN.

To get a better glimpse of the loop mechanism let’s use the Text Prediction NLP model as a guide. This model predicts the next words in a sentence based on prior word meaning.

RNNs are used for NLP as they perform well to predict and analyze sequences of information.

Be sure to watch this demo to understand what Text Prediction models do. 😉

In our model, the first step is to split the input sentence into individual words that are transformed into numerical vectors for computer readability. These vectors are individually entered into the neural network in sequence.

If the sentence is “Soup is Delicious!”, the vectorized form of “Soup” is calculated and then passed to the hidden layer.

word to vector transformation occurs through the one-hot encoding process

The second word “in” is passed likewise, despite an additional vector ‘y_soup’. This sum forms the vector ‘y_in’ that has the current word as well as previous word meanings.

After numerous such additions for each word in the sentence, we arrive at a vector that depicts the meaning and dependencies between the sentence as a whole, which can be used to predict further words in the sentence.

y_sentence can then be used for paraphrasing and other NLP uses | tanh is the activation function used

Note that the hidden layers shown above are not separate layers. They are one layer, in which the residual is cycled. If there were multiple layers the gradient descent process will be skewed.

Thus, we know the working processes of the recurrent neural network! However, there are few places where this logic can go wrong.

RNN Varieties

In fact, using the RNN logic shown for text prediction will not function as expected. This is because y_sentence has an uneven distribution of word influence.

For instance, later words such as “delicious!” and “is” in “Soup is delicious!” have more effect on the model’s future text prediction. Thus, there is a short-term word bias in our model.

the degree of a sentence’s influence on the model’s prediction is biased

Introducing long-term memory into the equation can balance the degree of influence of each word. It will also combat the vanishing gradient problem permitting longer sequences of words in which accurate meaning is extracted.

The technique employing both short and long-term memory is named the LSTM.

Long Short Term Memory

LSTMs are powerful tools. It is widely used for extended sequences of data — such as paragraphs and DNA — and lays the foundation for the natural language processing field with traditional RNNs.

Although the architecture of an LSTM differs from a classic RNN, it is considered a RNN itself for its memory capabilities.

The LSTM architecture is composed of four gates in which long-term and short-term memory — derived from an input sequence — is passed through to obtain a prediction and updated long and short term memory.

note that the text prediction process includes more steps including word embeddings, character embedding, and others.

The four gates noted include:

Forget Gate: Useless data is discarded from the long-term memory and useful data is kept.
Learn Gate: In the same way, the current event — in the sequence — and the short-term memory is joined to remove useless data.
Remember Gate: The preserved information after the Forget gate and Learn Gate is combined to form an updated long-term memory. This information is retained for future events.
Use Gate: Outputs from the Forget and Learn gates are calculated to use for the following prediction.

Next Steps and Future Readings

Now that we comprehend the ideas of memory and recurrent neural networks, the next step is to understand the mathematical operations behind them.

Consider reading this article to understand more about the functions inside each of the four gates. If you are curious about LSTM and RNN projects build your own using this image-captioning tutorial!

Good luck on your journey to machine learning heights!

Before you go…

Machine learning has always been a hotspot for me. It is used everywhere whether in snapchat filters or spam classifiers. Today, it’s more of a lifestyle than a buzzword.
That is why I got into the field of data-science. Since the beginning, I have been addicted and I hope I will always be.
if you enjoyed reading through this article feel free to connect with my socials 🤗 (p.s. looking for some problems to solve: send me some over if you get time)
LinkedIn | Newsletter | Twitter