Machine Learning Spotlight I: Investigating Recurrent Neural Networks

SAP Conversational AI
7 min readOct 12, 2017

Recurrent Neural Networks (RNNs) quickly became the go-to neural network architecture for Natural Language Processing (NLP) tasks. In this blog post, I’ll start with a broad definition of their architecture, and then explain what makes them so popular with the NLP community. Finally, I’ll list a collection of blog posts, tutorials, research papers, and frequently asked questions to help you discover the different flavours of RNNs.

A RNN can be seen a chain of copies of the same network. Credits to Christopher Olah (2015)

Over the course of the last few years, recurrent architecture for neural networks established themselves as state-of-the-art in several NLP tasks, including:

This successful breakthrough comes a long time after the first proposition of this kind of architecture, around 30 years ago [John Hopfield, Neural networks and physical systems with emergent collective computational abilities, 1982]. Modern architectures appeared roughly 10 years later [Sepp Hochreiter, Jürgen Schmidhuber, Long Short-Term Memory, 1997].

The main advantage of RNNs resides in their ability to deal with sequential data, thanks to their “memory”. Whereas Artificial Neural Networks (ANNs) have no notion of time, and the only input they consider is the current example they are being fed, RNNs consider both the current input and a “context unit” built upon what they’ve seen previously.

So the prediction made by the network at timestep T is influenced by the one it made at timestep T — 1. And when you think about it, that’s pretty much what we do, as humans, we use our previous experience (T — 1) to handle new and unseen things (T).
Christopher Olah puts it very nicely in his blog post, Understanding LSTMs:

“As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again.” — Christopher Olah

An early schema of a recurrent unit, notice the context units. Credits: Jeffrey Elman, Finding structure in time, 1990

And luckily for us, NLP is full of sequential (or temporal) data. Be it sentences, words, or characters, we always use the context to establish a more precise meaning for communication, whether it is written or oral.

Here are a few examples:

  • In Machine Translation, a word will carry different meanings based on the context. Sentiment Analysis will detect modifiers (like “very”, “not”, and “a bit too”) to grasp the intensity, polarity or negation of a sentiment.
  • In Dialog Management, the next step of a conversation is conditioned by the previous interactions and the goal given to the system. For Tokenization, we can use the next and previous characters to say whether or not a new word is beginning.

It doesn’t stop there: Part of Speech Tagging, Sentence Segmentation, Language Modeling, Semantic Role Labelling, Text Summarization, Spell Checking, and a whole lot of other tasks rely on the sequential nature of the data.

Google’s Neural Machine Translation uses a deep LSTM architecture with 8 encoders and 8 decoders layers using both attention and residual connections. Credit to Quoc Le, Mike Schuster.

But RNNs are not perfect yet: the need for the last timestep result at each timestep computation makes them slow to train, and computationally expensive. Today, more and more researchers are using Convolutional Neural Networks (CNNs), because they offer speed and accuracy improvements in many tasks.

An LSTM with an attention layer will yield state-of-the-art results on any task.

Still, the phrase “an LSTM with an attention layer will yield state-of-the-art results on any task” is not to be forgotten, and recurrent architectures will populate user-facing NLP systems and benchmark baselines for a long time.

Blogs

Introductions

Studies

Tutorials

Research

Surveys

Theses

Papers

FAQ

How are recurrent neural networks different from convolutional neural networks?

What is the difference between Recurrent Neural Networks and Recursive Neural Networks?

What is the difference between LSTM and GRU for RNNs?

What’s so great about LSTMs?

What is masking in a Recurrent Neural Network?

What is the attention mechanism introduced in RNNs?

Is LSTM turing complete?

When should one decide to use a LSTM in a Neural Network?

Why doesn’t LSTM forget gate cause a vanishing/dying gradient?

What is the difference between states and outputs in an LSTM?

Is it possible to do online learning with LSTMs?

Originally published at Recast.AI Blog.

--

--

SAP Conversational AI

Bot building software for the enterprise. Formerly known as Recast.AI, startup acquired by @SAP in Jan 2018 to transform customer experience with #bots and #AI