RNNs for Natural Language Processing

Matthew Kramer
CodeX
Published in
5 min readOct 11, 2021

In the past decade, there has been remarkable progress in the development of deep learning models for solving natural language problems. The current state of the art uses complex, massive neural networks with hundreds of millions of parameters. However, in order to understand state-of-the-art architectures, this article will provide a background on smaller neural networks to provide a segue into understanding the cutting edge.

Recurrent Neural Network (RNN)

What is it?

Recurrent neural networks are quite useful for handling sequential inputs, and almost all NLP tasks use sequential inputs. A sequential input is a series of inputs where the ordering of the inputs contains significance or meaning. Examining text in any language, it is clear that changing the ordering of the words can drastically change the meaning of the sentence. Unlike feed-forward neural networks, recurrent neural networks maintain a hidden state, and update this internal state as it processes each new input in the sequence — which enables RNNs to learn significance about the ordering of the input.

What is it useful for?

Recurrent neural networks in NLP have many practical applications, such as entity recognition, sentiment analysis, or machine translation to name a few. For the context of this article, the most basic forms of RNNs will be presented in order to understand more complex RNNs, such as Long Short Term Memory networks (LSTMs), Gated Recurrent Units (GRUs), attention models, and transformers.

How does it work?

Let’s consider a task for entity recognition: determining which words are names in a sentence. As an example, consider the sentence “Vincent and Jules take in the place”. For each word in this sentence, the output should be whether the word is a name or not.

Since RNNs can only receive numbers, or vectors as input, rather than character representations of text, each word will need to be converted to a unique vector. In this example, we will use one-hot-encoding vectors. To brush up on word-embeddings, check out this article:

So now, each input is a one-hot vector of 0s with a single 1 entry:

Let’s take a look at how a recurrent network looks for this use case

Each input has its own small neural network - labelled as Recurrent Unit in the diagram. The recurrent unit has two inputs: the one-hot encoded vector of the input word, as well as a hidden state - appearing as h^t in the diagram. The hidden state is another vector embedded into the network that encodes the context of previous words and predictions so that later recurrent units can use this context to learn how the ordering of words affects the meaning of the input. For example, consider the sentence “Vincent and Jules take in the place.” When the neural network is attempting to predict whether “Jules” is a name, the hidden state will encode the previous two words, “Vincent” and “and” in order to realize that the word “Jules” is more likely to be a name.

Each Recurrent Unit also has two outputs: a prediction, and the hidden state for the next recurrent unit.

The recurrent unit is an individual neural network and can vary in structure depending on the use case. The internals of the recurrent unit can become a bit math intensive, and it’s not too important to understand all the details. In short:

  1. The input x vector and the hidden state h vector are concatenated together.
  2. To produce the current prediction, ŷ, the concatenated vector is multiplied by a set of learned weights and added to a bias, and then put through an activation function, usually the sigmoid function.
  3. To produce the next hidden state, h, the concatenated vector is multiplied by a set of separately learned weights and added to a separate bias, and then put through an activation function, usually the tanh function.

The following diagram represents this for our first recurrent unit from the previous diagram.

The hidden state vector in this diagram has a dimensionality of 100, but this can be any value, such as 200, or even 1000. The dimensionality of the input vector depends on the vocabulary size, and in this case assumes the vocabulary size is 10,000.

During training, the first hidden state, h⁰ can be initialized to a vector of 0s, or random numbers.

What are the limitations?

This simple recurrent neural network approach has a number of drawbacks, which is why current state-of-the-art models use more advanced architectures that try to alleviate some of these constraints.

  1. A commonly known issue referred to as ‘vanishing gradients’. For longer input sequences, the hidden state vector starts to lose the context for earlier words in the sequence. This is due to the nature of the training algorithm and activation functions causing gradients to approach 0. For a more mathematical dive into this issue, see https://youtu.be/qhXZsFVxGKo
  2. It is difficult to run in parallel. For a new prediction, each recurrent unit in the network requires that the previous recurrent unit computes the hidden state before it. During training, gradients and loss have to be back-propagated through the network in a reverse fashion to how the hidden state is propagated through the network, thus making training hard to parallelize as well.

Gated Recurrent Units (GRU) and Long Short Term Memory (LSTM) attempt to compensate for the vanishing gradients limitation, while transformers attempt to solve the second limitation by making the network extremely parallelizable.

Useful Resources

--

--