The intuition behind recurrent neural networks

Serban Liviu
25 min readSep 14, 2020

--

Why a new post about RNN?

Why yet another article about recurrent neural nets? Aren’t enough of them on the internet already? Well, yeah. There are. But for some reason, I feel like most of them, or at least a large part of them do not really try to give an actual intuition on why RNNs work the way they work. I just have this feeling that a lot of the articles do show the internals of RNNs by showing say the formulas for LSTMs or GRUs and you kind of have to take them by faith. Other articles do show example code, but again you go through it, you train the network to do whatever it does, but it still feel like magic at the end of the day. At least that’s my feeling. The kind of questions that I’ve had about RNN, when I was learning about them, were something like what exactly does that recurrent memory holds and how it is so versatile with the info that it holds. How exactly does it look like? What does those double values in those vectors and matrices really represent? To give you an actual example, the classic sentiment analysis problem, where you train a RNN to classify sentences that are review for some products into good reviews or bad reviews for instance. You input a natural language sentence and the RNN will tell you if it a good review or a bad one. As an example, this is an easy example representing a good review: “I bought iPhone XR a few days ago. It is awesome.” that can technically be detected even with classic algorithmic tricks without using any RNN ML

And this is a rather tricky one representing a negative review: “Honda Accords and Toyota Camrys are nice sedans, but hardly the best car on the road”. And I hope you’d agree that no matter how complex or sophisticated an algorithm is, you will still end up having to solve for lots edge cases and exceptions and then exceptions to exceptions until you will just give up. Because the natural language is just so damn complex and vague in nature. The only way to solve this, is with a recurrent networks. And there are ton of repos with example code to solve exactly this kind of problems. And if you play with them, then bang. You might have a working example of a sentiment analysis network. But the question still remains of how the internals of it really work. From out side it just looks like a bunch of vector and matrix multiplication. It look like magic right? Well no really. It can be explained using linear algebra. And this is what we are going to explore in this post.

In this article I’m not going to show you any actual code. The reason I think I’ve provided in the previous paragraph. There is just a ton of code with comments and explanations. There is no reason for this post to have yet another version of an over used RNN python example in it. I just want to try at least to give some basic intuition on the inner workings of RNNs. But I’ll provide links to any repos with code worth looking into along the way.

Before you start

It is desirable if you have some basic knowledge of neural nets before reading this post. Otherwise some of the stuff might look weird or it might not make any sense whatsoever. I’m talking about things like back-propagation. If you want to refresh some basic knowledge, here are some useful resources for you:

The Essence of linear algebra course by 3Blue1Brown

Andrew Ng’s Coursera course

And for more advanced stuff, check out Neural networks and deep learning book by Michael Nielsen

What is recurrent neural net?

A recurrent neural net is a network that unlike classic feed forward networks can process variable length sequences of data. Bellow is a classic feedforward net:

The input for this specific network will always be a vector of size (5,1). Feedforward nets are very good when it comes to fixed length data such as images. But if we would like to process text for instance, there is no clear way how a feedforward net could process it.

PS: I know that 1-D convolutions can process text even though they are feed forward in nature. And transformers are not recurrent yet they are quite good with text also. But that’s beyond the scope of this post. Maybe in a future post.

Recurrent nets are used in tasks such as prediction. Like word prediction or sentence completion. Just try to type “How to” in a Google search bar and see it for yourself. This type of RNN is called a sequence to vector model, because you input a variable length sequence of words (actually they are vectors) say 4 words, and then you check for what is the forth word predicted by the RNN

Another type of problem that can be solved by sequence to vector models are classification of variable length data such as in sentiment analysis. You input a sentence that represents a review for a product, and the RNN has to classify that as good review or bad review. We’ve just discussed about that in the intro section.

Another type of RNNs are known as vector to sequence model, where you use a vector of features for an image, that was feed forwarded to a convolutional network, as a the first hidden state of the RNN, and then you output a variable length sequence that will represent the description of that image. Check the image bellow:

And finally there is a third type of RNNs called sequence to sequence models where the input is a variable length sequence, say an english sentence, and the output is another variable length sequence, say the french translation of that english sentence.

Of course all of these might sound weird and exotic, but don’t worry about it. We will try to uncover their secrets. Of course we will not cover all the three models in this post. Obviously there is so much to cover about RNNs in general, that a single post is not enough unless this post morphs itself into a book :)

But first, how does the simplest RNN look like?

A simple RNN with one recurrent layer

In the figure above, we have a very simple recurrent network, also called a vanilla RNN. We have an input x_t, a hidden layer called h_t and an output layer called y_t. The black and the green arrows are the connection weights. But we also see another rounded red arrow. Those represent a set of weights, for the so called recurrent layer. We have a vector called h that acts as a memory for the network. At every step, we will take the previous input h_t, multiply it by the recurrent weights(the red arrow) then take the transformed current input x_t and add the resulting vectors up. After that we use a non linearity such as sigmoid or tanh or relu. The result will be then stored into the memory vector h which will now become h_t+1. We will repeat this process for every input in our input sequence. This way we can view the network in the so called unfolded way like in the image below:

The previous network unfolded in time

Maybe it is time to give some formulas to formalise this process.

Formula for computing the memory state of the vanilla RNN
Formula for computing the output of an RNN

So we have 3 set(matrices) of weights, the W_xh that will transform the x_t input, the W_hh that will transform the previous memory state, and the W_hy that will transform the current memory state to produce an output.

Let’s see even in more detail of how this network looks like. Let’s suppose for the sake of simplicity that the input layer has 11 neurons, the hidden recurrent layer, only 3 neurons and the output layer has also 11 neurons. There you go:

The actual weight connections of a simple and tiny recurrent neural network

Not sure I do need to go over the above figure. I hope it’s self explanatory. Every part of the net is tagged with its corresponding term from the formulas above. Now lets look at the same network unfolded two steps in time:

The previous network unfolded in two steps in time

Here we would need to do a little bit of explaining.

The first step:

x_1 is the first input and h_0 is a zero vector, because this is the first step and thus there is no previous memory. x_1, which gets transformed by the W_xh matrix into a lower dimensional vector(from (11,1) to (3,1)). Then a sigmoid is applied. The resulting vector will be used to initialize the recurrent memory which now is being referred to as h_1. After that, the h vector will be used to produce the y_1 output by multiplying it with the W_hy matrix.

Now the second step.

We use x_2 as the next input. However, the h memory contains a transformed version of the previous input x_1. We will apply the h formula which you can find above, and the result will be stored in the h vector. It’s an update operation of the recurrent memory h. Now the recurrent memory will contain information about both x_1 and x_2 inputs. I’ve used a cool gif from this great article to showcase what happens with the recurrent memory h after we go through a few steps:

And below is how this 4 timesteps look in the actual network feedforward process:

The next thing we want to cover are word embeddings. Because they are fundamental in the way neural nets deal with text.

Representing words in neural networks. Introducing one hot encoding and word2vec embeddings.

We are not going to dig too much into the details of the word2vec here because it would be beyond the scope of this post. Just to say a few words of what these word embedding mean and how we are going to use them in order to understand recurrent neural nets. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Every word is represented as a “dense vector” of size 50. We could use vectors of other sizes such as 100 or 150 or 300. There is no actual limit on that. In this model, synonymous words are very close in the high dimensional vector space where those vectors lie. For instance the words “emperor” and “king” will have very similar vectors(a.i. theirs vectors will have a large cross product) because they are synonymous. This is how they might look like:

Of course we can’t visualize vectors in 50 dimensions, so we’ve used PCA to map vectors of 50 dimensions to the more familiar two dimensions so we could compare their relative positions in the original 50 dimension vector space.

Probably the most famous example of word2vec vector manipulation is the king, man, woman, queen shown bellow:

What is says is that if we take the vector of word king and subtract the vector for man and then add the vector for woman the resulting vector will be a vector very very close to the vector for the word queen. There are a ton of other cool examples like this such as [madrid] — [spain] + [france] ~ [paris] where [word] is the word2vec vector for that word.

For more info on how word embeddings and principal component analysis work check these links here:

Now that we have a representation for our words, how exactly are we going to select different words for each step in the training/prediction process of the recurrent neural net. When we want to input “emperor has no” to a trained RNN, we would like to get the word2vec vector for each of the three words in this simple sentence, and the output after these three steps will be another vector, hopefully one resembling or one very close the the vector of the word “clothes”. So we need a method to select individual vectors. We can use a pre-trained word2vec such as glove, that was trained on the whole Wikipedia text, and then use a hash map to index these words. In the input layer we will have 50 neurons(the size of the word2vec in this case) and then we will input each vector in the sequence of words, using the hash described above to select the vectors. That’s one method. But what happens when we encounter a word that is not in the pre-trained word2vec? If you use the Wikipedia pre-trained word2vec and then you want to train your RNN word prediction on say the Bible or some Shakespeare opera, there is a very high chance that a lot of the words in the Bible do not have a match in the pre-trained Wikipedia, or even though the word will be present in the pre trained word2vec, the actual embedding for a biblical word is not going be a very good one. It could be really bogus. If you understand how word2vec works you understand what I’m saying here. Word vectors learn to encapsulate the meaning of a word by looking at the context in which that word appears many many times in that large corpus. A biblical word such as thessalonians appears in very different contexts in the Bible and many many more times, than it appears in the wikipedia corpus. As a result, the word embedding for the word thessalonians in the pre-trained wikipedia word2vec might not be as good and might not reflect the actual meaning of that word compared to a word embedding for the same word that was trained to learn word embedding right from the Bible itself. So using pre-trainned word embeddings might work well if the corpus on which you want to train your network happens to be the same as the corpus on which the word2vec was pre-trained. But that’s pretty limiting. So the best approach in cases like these, is to learn the actual word embeddings from the corpus itself. You do not use any pre-trained word2vec at all. You just train the network on you corpus and the embedding will be learned and eventually they will be as powerful and expressive as any word2vec flavour but geared towards your specific corpus.

What we are going to do is the following: We will count all the unique words in our corpus. Let suppose that our corpus is “The Republic” by Plato. That text contains 10649 unique words for instance. So each of the 10649 words will have a unique index from 1 all the way to 10649.

Now, if the size for the vector embedding is say 300, then we will have as the first layer, also called the embedding layer, a layer with 10649 neurons, that projects into the second layer of 300 neurons. The input vector will be of size 10649 where each component is zero except for the the word that we want to select in which case that would be 1. Let’s suppose for simplicity that we only have 8 words in our vocabulary and an embedding of size 3. This is how it will look like:

In the image above, the fifth neuron, the black one, has the value of 1 while the rest of them are 0. Multiplying a one hot vector like this of size (8,1) with the weight matrix of size (3,8) is equivalent to selecting the fifth column in the weight matrix (the bolded arrows in the image). If you want an even more concrete example, check the image below.

As you would imagine, a (300,10649) weight matrix is something terribly huge. And when training, we will eventually learn the embeddings for all those 10649 words in the corpus.

Why will this configuration learn word embedding just like a word2vec algorithm? Well if we train a network like that to predict a word given a series of previous words, this is basically the same as word2vec trying to predict(to maximize the conditional probability) of a word given a window of 3 or 5 words around it. Again, if that does not make any sense to you, please stop and go through the word2vec articles provided above. Otherwise nothing will make sense in this post.

Show me a real example

I think we do have all the prerequisites for going through the inner workings of a recurrent neural network

Let’s propose a classic neural network that predicts the next word given a sequence of words. So if I feed this to the neural net: “Emperor has no” the network’s output should be “clothes”. Let’s suppose that the network is already trained. We will be going into how to train these nets, but for now let’s understand how a trained one works. If the network was trained on a large corpus such as Wikipedia, and has as inputs one hot vectors of thousands of entries, and the word embeddings are of size 300, it is obvious that we cannot see or visualize these vectors in action and how the learned weight matrices transform them from one very high dimensional vector space to another. So we will look at vectors in 2D space, and just pretend that they are in 300 dimensions for instance.

Th architecture of the network is something like this:

The embedding layer has more than 40000 neurons in it.

The hidden layer has 300 neurons + plus the recurrent memory state of the same 300 neurons.

And the output layer , also known the softmax layer is equal with the on hot encoding from the input layer. Since we are predicting words, the vocabulary is the same right?

Let’s start with inputing the first word which is emperor.

Since this is the first word, the h_0 vector is all zero, so no contribution there. The new memory h_1 is just the tanh-ed emperor’s vector.

On the next step we will input the word has:

Ok. So we do need to explain what is happening here. First we have the h_1 vector from the previous step coloured in bleu. The x_1 input vector is the green one. Now we apply the recurrent formula shown earlier. Here is it again

The h_1 is being transformed by the learned W_hh matrix. And we get from the bleu vector to the orange one. For the sake of simplicity let’s suppose that the W_hh matrix is just a simple rotation by 45 degrees in the clockwise direction. In reality, for the embedding of 300 neurons, the W_hh would be (300, 300). That very large matrix could represent any type of affine transformation or a combination of multiple transformations such as rotations, scalings, shears, stretches and translations. That does not matter now. Lets just go with a simple rotation for W_hh as shown in this matrix :

But what about W_xh ? How are we going to transform the input vector? And the answer is that we’ve already “transformed” it. Remember all the talk about the one hot encoding? In reality the input is a one hot encoded vector for the emperor word which will be multiplied by the very large embedding matrix which will result in selecting the embedding vector for the emperor word. That’s the W_hx*x_h part in the recurrent formula.

The next step is that we have to add up these two vectors which will result in the light blue vector. And finally the last step is applying the tanh activation function to each entry of that vector and that will result in that red vector which as you can see it is very close to the OY axis. This red vector is now the updated memory of our network.

For now we will ignore the output of the network. Yes it is there. But we will look at it after we input a few words. We can chose to look at each one in part or we can chose to look at what word was predicted after a few input words.

Moving on to the third word which is no:

Now the memory from the previous is the red vector. The embedding vector for the third word no is the green one. We will apply the same recurrent formula. Multiplying h_2 by W_hh will transform it to the blue vector. Adding the vector for no with this transformed h_2 will produce that orange vector. Then we apply the tanh non linearity and we get the magenta vector. That’s the new memory for the network.

And now it’s time to look at the actual output of the network. Well it’s the h_3 memory vector actually.

But wait a minute, didn’t you say that the memory h_t of a RNN will be multiplied by another matrix W_hy? What happened to that? Well. That it is done in the feedforward process, but keep in mind that the output layer is a softmax layer.

How softmax works and how can we interpret it geometrically

A softmax layer is by definition a probability distribution layer. The point of it is to select the word with the highest probability based on the state of the recurrent memory. So you multiply your (300, 1) recurrent memory h with the (300, 40000) matrix and the result will be a huge vector of size (40000, 1). From that you select the neuron with the highest value. And the index of that neuron will be the index of the predicted word. Remember that just as in the embedding input layer every word in the vocabulary has an index ?

Bellow is the formula for the softmax activation function that is applied in the output layer. Ok I must admit that it looks scary. So we will ignore it for now and will come back it later.

The simplest way to understand softmax is the following:

You have this vector (3,2,1). If we replace every entry by the value divided by the sum of the total values we will get this: (3/6, 2/6,1/6) = (0.50, 0.33, 0.16). That looks like a probability distribution right? The first entry has he highest probability of 1/2. So we will select that. It’s almost like looking for the maximum value. But we use this instead because once we model this as a probability distribution we can do two things. We can chose the element with the highest probability, or can sample from this probability distribution. The other reason why we do not use a simple maximum formula is that we want this whole thing to be differentiable in order for the back propagation to work. That’s why in the actual formula for softmax we use the e number. It helps with the derivatives computation when training using back-propagation.We will eventually discuss about that at some point, maybe in the next article.

Let’s look at a contrived example to understand the multiplication between the recurrent memory and the softmax weight matrix

Multiplying the (3,1) h memory by the (3,8) softmax output layer matrix

We see that the result of an imaginary (3,1) recurrent memory state multiplication with a (3,8) softmax output layer matrix will produce a (1,8) output vector. The numbers in this vector will then be used in the softmax activation formula in order to be modeled as a probability distribution over the vocabulary of just 8 words in this example.

The resulting softmax vector will look like this :

softmax probability distribution over the word in vocabulary

The neuron with the highest probability is the seventh one with the value of 0.18. But what exactly does that mean. Let’s look again at the previous matrix multiplication.

In general multiplying a (n, 1) vector by a (n, m) matrix is equivalent with computing the cross product between our input (n, 1) vector and every column in the (n, m) matrix. We see that the cross product between the input vector and the seventh column is the highest.Therefore this one will have the highest probability.

This is the softmax layer for this very simple example illustrated. The colours on the output layer represent the values of those neurons. Black is very close to 1 whereas white is very close to 0.

The bolded arrows represent the seventh column in the weight matrix of (3,8)

The cross product between our recurrent memory of (3,1) with this column is the highest amongst all the cross products in this layer. That’s why this neuron is painted black to represent a a high value.

Now let go back to the last image when we were looking at the feedforward process in the RNN.

Now hopefully it is more clear now. The recurrent memory h_3 was multiplied by the softmax weight matrix and the cross product between it and the embedding vector for the word “clothes” is the largest amongst all the words in the vocabulary. So that’s why the predicted word after inputing the sequence “emperor has no” is “clothes

Hope that makes sense now :)

Let’s do another similar example, but instead of the first word being “emperor”, let’s use the word “king”. Just to get the feel of it.

We’ve shown earlier that the words “emperor” and “king”, being very similar, have very similar vectors in the 300 dimensions vector space of the learned word embeddings.

This is now h_1 :

And now we will inout the second word which as we know from the previous example is “has

I’m not going to explain all the vectors in these images. The vectors are tagged anyway with whatever they represent.

Now inputing the third word “no”:

And now let’s look at how this h_3 vector is in relation with the rest of word embeddings in the vocabulary by multiplying it with the softmax weight matrix. Looks like the word “clothes” is also the closest vector to it:

That should make sense by now. The logic is that synonymous words should produce similar internal memories for the RNN.

Expanding even further, not only words have embeddings, but sentences as well. In the previous image, if we go one step further and apply the recurrent formula using the last word in this simple sentence, aka the word clothes, we would get another h_4 memory state which will be the embedding for that whole sentence in the vector space. Bellow is the computation for that.

Again, I hope it’s self explanatory. But we will go over it anyway. The previous state of the network was h_3 aka the blue vector, which we used to predict the next word in the sentence. If we transform this one by the same old W_hh matrix, we will get the green vector. Then we add this with the red vector which is the embedding for the clothes vector. And that will produce that long magenta vector. Applying the tanh activation on that, will result in the orange vector. And that’s the embedding for the “king has no clothes” sentence. Just as embedding vectors for words contain the “meaning” for that word, a sentence vector contains all the meaning packed in that sentence.

Here’s a last example of how sentence embeddings look like in a very high dimensional vector space:

Similar sentences are close together in summary-vector space.

This example was taken from the great paper that introduced sequence to sequence models to the world.

I hope by now you do realize that this kind of sentence similarity cannot be achieved by algorithmic tricks no matter how sophisticated they are. You just cannot create a sentence database where you group similar sentences somehow. You might use WordNet or other lexical databases that were created manually by lots of people to give you synonyms and/or concept similarity for words, but doing that for entire sentences it’s just not possible. And even in the case of WordNet, that works for english and other languages for which a WordNet database was constructed. But for a language such as Romanian, you would have to construct it manually yourself. And that’s clearly not an option. Whereas you can use THE SAME neural network architecture, but train it with a large corpus in a different language and that alone will create you all the word embeddings and the sentence embedding as well in that other language. You would just have to wait a few hours for the training process to end. Pretty neat ah ?

Will this stuff really work in real life? I mean if we code a network with the architecture that we’ve previously described with 40,000 plus words in the embedding layer and only one hidden recurrent layer that learns both word and sentence embeddings of 300 neurons. And then we train this. Will this actually give us exactly the kind of word predictions that we’ve described in our contrived example? Maybe ….not really. For several reasons. One is that if a word requires say 300 dimension vectors to capture their meaning, it is obvious that a whole sentence might require way more space for that. So in this case we might need to introduce a second hidden recurrent layer. That will do exactly that. It will learn sentence embeddings. Otherwise if we insist with having just one layer, we might end up learning good word embeddings and very poor sentence embeddings that will look nothing like the ones we’ve shown in the earlier picture. Here’s a sketch of a two layer RNN.

Recurrent neural net with two hidden layers

In general this is what another layer will do to a network. It learns things that are more abstract than the previous layer. Just like in the case of conv-nets. The first layer learn edges and simple shapes from the pixels. And based on that, another hidden layer will learn more abstract things such as shapes. And then a third layer might learn even more abstract things such as eyes and noses.Etc

Memories and thoughts are vectors

This is what Geoffrey Hinton once said about thoughts in our head:

“The implications of this for document processing are very important. If we convert a sentence into a vector that captures the meaning of the sentence, then Google can do much better searches; they can search based on what’s being said in a document. Also, if you can convert each sentence in a document into a vector, then you can take that sequence of vectors and [try to model] natural reasoning. And that was something that old fashioned AI could never do. If we can read every English document on the web, and turn each sentence into a thought vector, you’ve got plenty of data for training a system that can reason like people do. Now, you might not want it to reason like people do, but at least we can see what they would think. What I think is going to happen over the next few years is this ability to turn sentences into thought vectors is going to rapidly change the level at which we can understand documents. To understand it at a human level, we’re probably going to need human level resources and we have trillions of connections [in our brains], but the biggest networks we have built so far only have billions of connections. So we’re a few orders of magnitude off, but I’m sure the hardware people will fix that.”

Read more about thought vectors in these article:

So even though not that much is known about how the brain works there is consensus that memories that you have in you brain can be considered to be vectors just like the word and sentence vectors shown in this post. Although in the Neuroscience they are not referred to as vectors but rather as “neuronal population”. A group of neurons, in which some are active and some are not, represent a pattern. And this pattern will activate some other regions of the brain which in turn God know what they do :) They might activate regions of temporal lobe responsible with language and you start to “speak your mind” :)

Another great research paper is the one bellow where it seems that when we want to translate a word from one language to another, our brains first encode words into some kind of language agnostic representation (a type of word embedding just like word2vec maybe? ) and then that is decoded into the target language.

So it seems, although there is still a lot of research to be done, that our brains works with “thought vectors” to encode meaning of words and sentences somehow. It has to do something like that. It remains to be seen how exactly.

Some final thoughts

This post was meant to show you graphically the intuition of how a vanilla RNN works and in fact how its recurrent memory state might look like.

We did not go into more complicated stuff such as LSTMs, GRUs or attention mechanism. Or how RNNs learn using the back-propagation through time algorithm. We will explore all these in future posts. But for the moment, before you move to more advanced stuff about recurrent nets or even transformers, it’s better to have a clear picture of the fundamentals of RNNs. I hope you’ve got that from this post. Otherwise please let me know in the comments section

--

--