Machine Learning Spotlight I: Investigating Recurrent Neural Networks

7 min readOct 12, 2017

Recurrent Neural Networks (RNNs) quickly became the go-to neural network architecture for Natural Language Processing (NLP) tasks. In this blog post, I’ll start with a broad definition of their architecture, and then explain what makes them so popular with the NLP community. Finally, I’ll list a collection of blog posts, tutorials, research papers, and frequently asked questions to help you discover the different flavours of RNNs.

A RNN can be seen a chain of copies of the same network. Credits to Christopher Olah (2015)

Over the course of the last few years, recurrent architecture for neural networks established themselves as state-of-the-art in several NLP tasks, including:

Named Entity Recognition [Zhiheng Huang et al., Bidirectional LSTM-CRF Models for Sequence Tagging, 2015]
Language Modeling [Stephen Merity et al., Regularizing and Optimizing LSTM Language Models, 2017]
Machine Translation [Yonghui Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016].

This successful breakthrough comes a long time after the first proposition of this kind of architecture, around 30 years ago [John Hopfield, Neural networks and physical systems with emergent collective computational abilities, 1982]. Modern architectures appeared roughly 10 years later [Sepp Hochreiter, Jürgen Schmidhuber, Long Short-Term Memory, 1997].

The main advantage of RNNs resides in their ability to deal with sequential data, thanks to their “memory”. Whereas Artificial Neural Networks (ANNs) have no notion of time, and the only input they consider is the current example they are being fed, RNNs consider both the current input and a “context unit” built upon what they’ve seen previously.

So the prediction made by the network at timestep T is influenced by the one it made at timestep T — 1. And when you think about it, that’s pretty much what we do, as humans, we use our previous experience (T — 1) to handle new and unseen things (T).
Christopher Olah puts it very nicely in his blog post, Understanding LSTMs:

“As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again.” — Christopher Olah

An early schema of a recurrent unit, notice the context units. Credits: Jeffrey Elman, Finding structure in time, 1990

And luckily for us, NLP is full of sequential (or temporal) data. Be it sentences, words, or characters, we always use the context to establish a more precise meaning for communication, whether it is written or oral.

Here are a few examples:

In Machine Translation, a word will carry different meanings based on the context. Sentiment Analysis will detect modifiers (like “very”, “not”, and “a bit too”) to grasp the intensity, polarity or negation of a sentiment.
In Dialog Management, the next step of a conversation is conditioned by the previous interactions and the goal given to the system. For Tokenization, we can use the next and previous characters to say whether or not a new word is beginning.

It doesn’t stop there: Part of Speech Tagging, Sentence Segmentation, Language Modeling, Semantic Role Labelling, Text Summarization, Spell Checking, and a whole lot of other tasks rely on the sequential nature of the data.

Google’s Neural Machine Translation uses a deep LSTM architecture with 8 encoders and 8 decoders layers using both attention and residual connections. Credit to Quoc Le, Mike Schuster.

But RNNs are not perfect yet: the need for the last timestep result at each timestep computation makes them slow to train, and computationally expensive. Today, more and more researchers are using Convolutional Neural Networks (CNNs), because they offer speed and accuracy improvements in many tasks.

An LSTM with an attention layer will yield state-of-the-art results on any task.

Still, the phrase “an LSTM with an attention layer will yield state-of-the-art results on any task” is not to be forgotten, and recurrent architectures will populate user-facing NLP systems and benchmark baselines for a long time.