Machine Learning Spotlight I: Investigating Recurrent Neural Networks
Recurrent Neural Networks (RNNs) quickly became the go-to neural network architecture for Natural Language Processing (NLP) tasks. In this blog post, I’ll start with a broad definition of their architecture, and then explain what makes them so popular with the NLP community. Finally, I’ll list a collection of blog posts, tutorials, research papers, and frequently asked questions to help you discover the different flavours of RNNs.
Over the course of the last few years, recurrent architecture for neural networks established themselves as state-of-the-art in several NLP tasks, including:
- Named Entity Recognition [Zhiheng Huang et al., Bidirectional LSTM-CRF Models for Sequence Tagging, 2015]
- Language Modeling [Stephen Merity et al., Regularizing and Optimizing LSTM Language Models, 2017]
- Machine Translation [Yonghui Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016].
This successful breakthrough comes a long time after the first proposition of this kind of architecture, around 30 years ago [John Hopfield, Neural networks and physical systems with emergent collective computational abilities, 1982]. Modern architectures appeared roughly 10 years later [Sepp Hochreiter, Jürgen Schmidhuber, Long Short-Term Memory, 1997].
The main advantage of RNNs resides in their ability to deal with sequential data, thanks to their “memory”. Whereas Artificial Neural Networks (ANNs) have no notion of time, and the only input they consider is the current example they are being fed, RNNs consider both the current input and a “context unit” built upon what they’ve seen previously.
So the prediction made by the network at timestep T is influenced by the one it made at timestep T — 1. And when you think about it, that’s pretty much what we do, as humans, we use our previous experience (T — 1) to handle new and unseen things (T).
Christopher Olah puts it very nicely in his blog post, Understanding LSTMs:
“As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again.” — Christopher Olah
And luckily for us, NLP is full of sequential (or temporal) data. Be it sentences, words, or characters, we always use the context to establish a more precise meaning for communication, whether it is written or oral.
Here are a few examples:
- In Machine Translation, a word will carry different meanings based on the context. Sentiment Analysis will detect modifiers (like “very”, “not”, and “a bit too”) to grasp the intensity, polarity or negation of a sentiment.
- In Dialog Management, the next step of a conversation is conditioned by the previous interactions and the goal given to the system. For Tokenization, we can use the next and previous characters to say whether or not a new word is beginning.
It doesn’t stop there: Part of Speech Tagging, Sentence Segmentation, Language Modeling, Semantic Role Labelling, Text Summarization, Spell Checking, and a whole lot of other tasks rely on the sequential nature of the data.
But RNNs are not perfect yet: the need for the last timestep result at each timestep computation makes them slow to train, and computationally expensive. Today, more and more researchers are using Convolutional Neural Networks (CNNs), because they offer speed and accuracy improvements in many tasks.
An LSTM with an attention layer will yield state-of-the-art results on any task.
Still, the phrase “an LSTM with an attention layer will yield state-of-the-art results on any task” is not to be forgotten, and recurrent architectures will populate user-facing NLP systems and benchmark baselines for a long time.
Blogs
Introductions
- Denny Britz, Recurrent Neural Networks Tutorial Part 1: Introduction to RNNs, 2015
- Christopher Olah, Understanding LSTMs, 2015
- Rohan Kapur, Recurrent Neural Networks & LSTMs, 2017
- Maxim Kolomeychenko, Yuri Borisov, Evolution: from vanilla RNN to GRU & LSTMs, 2017
- Rafal Karczewski, Natural Language Processing in Artificial Intelligence is almost human-level accurate, 2017
- DL4J, A Beginner’s Guide to Recurrent Networks and LSTMs
Studies
- Andrej Karpathy, The Unreasonable Effectiveness of Recurrent Neural Networks, 2015
- Yorav Goldberg, The unreasonable effectiveness of Character-level Language Models, 2015
- Christopher Olah, Shan Carter, Attention and Augmented Recurrent Neural Networks, 2016
- Aidan Gomez, Backpropagating an LSTM: A Numerical Example, 2016
- R2RT, Written Memories: Understanding, Deriving and Extending the LSTM, 2016
- R2RT, Non-Zero Initial States for Recurrent Neural Networks, 2016
- Leonard Blier, Attention Mechanism, 2016
- Edwin Chen, Exploring LSTMs, 2017
- Tigran Galstyan, Hrant Khachatrian, Interpreting neurons in an LSTM network, 2017
- Sebastian Ruder, Deep Learning for NLP Best Practices, 2017
- Arun Mallya, LSTM Forward and Backward Pass
Tutorials
- Denny Britz, Recurrent Neural Networks Tutorial, 2015
- Narek Hovsepyan, Hrant Khachatrian, Generating Constitution with recurrent neural networks, 2015
- Erik Hallström, How to build a Recurrent Neural Network in TensorFlow, 2016
- R2RT, Recurrent Neural Networks in Tensorflow I, 2016
- Tigran Galstyan et al., Automatic transliteration with LSTM, 2016
- Denny Britz, RNNs in Tensorflow, a Practical Guide and Undocumented Features, 2016
- TensorFlow, Recurrent Neural Networks, 2017
- Peter Roelants, How to implement a recurrent neural network
Research
Surveys
- Junyoung Chung et al., Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, 2014
- Klaus Greff et al., LSTM: A Search Space Odyssey, 2015
- Zachary C. Lipton et al., A Critical Review of Recurrent Neural Networks for Sequence Learning, 2015
- Rafal Jozefowicz et al., An Empirical Exploration of Recurrent Network Architectures, 2015
- Andrej Karpathy et al., Visualizing and Understanding Recurrent Networks, 2015
- Wim De Mulder et al., A survey on the application of recurrent neural networks to statistical language modeling, 2015
Theses
- Felix Gers, Long Short-Term Memory in Recurrent Neural Networks, 2001
- Alex Graves, Supervised Sequence Labelling with Recurrent Neural Networks, 2008
- Tomas Mikolov, Statistical Language Models Based on Neural Networks, 2012
- Ilya Sutskever, Training Recurrent Neural Networks, 2013
- Richard Socher, Recursive Deep Learning for Natural Language Processing and Computer Vision, 2014
Papers
- John Hopfield, Neural networks and physical systems with emergent collective computational abilities, 1982
- Michael Jordan, Serial order: A parallel distributed processing approach, 1986
- Jeffrey Elman, Finding structure in time, 1990
- Paul Werbos, Backpropagation through time: what it does and how to do it, 1990
- Yoshua Bengio et al., Learning Long-Term Dependencies with Gradient Descent is Difficult, 1994
- Sepp Hochreiter, Jürgen Schmidhuber, Long Short-Term Memory, 1997
- Mike Schuster, Kuldip K. Paliwal, Bidirectional Recurrent Neural Networks, 1997
- Alex Graves et al., Multi-Dimensional Recurrent Neural Networks, 2007
- Richard Socher et al., Parsing Natural Scenes and Natural Language with Recursive Neural Networks, 2011
- Razvan Pascanu et al., On the difficulty of training Recurrent Neural Networks, 2012
- Martin Sundermeyer et al., LSTM Neural Networks for Language Modeling, 2012
- Kyunghyun Cho et al., Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, 2014
- Kyunghyun Cho et al., On the Properties of Neural Machine Translation: Encoder–Decoder Approaches, 2014
- Ilya Sutskever et al., Sequence to Sequence Learning with Neural Networks, 2014
- David Krueger, Roland Memisevic, Regularizing RNNs by Stabilizing Activations, 2015
- Lifeng Shang et al., Neural Responding Machine for Short-Text Conversation, 2015
- Oriol Vinyals, Quoc Le, A Neural Conversational Model, 2015
- Xingxing Zhang et al., Top-down Tree Long Short-Term Memory Networks, 2015
- Kaisheng Yao et al., Attention with Intention for a Neural Network Conversation Model, 2015
- Zhiheng Huang et al., Bidirectional LSTM-CRF Models for Sequence Tagging, 2015
- James Bradbury et al., Quasi-Recurrent Neural Networks, 2016
- Sungjin Ahn et al., A Neural Knowledge Language Model, 2016
- Shalini Ghosh et al., Contextual LSTM (CLSTM) models for Large scale NLP tasks, 2016
- Yonghui Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016
- Julian Georg Zilly et al., Recurrent Highway Networks, 2016
- Kamil Rocki, Recurrent Memory Array Structures, 2016
- Junyoung Chung et al., Hierarchical Multiscale Recurrent Neural Networks, 2016
- Anjuli Kannan et al., Smart Reply: Automated Response Suggestion for Email, 2016
- Jason Weston, Dialog-based Language Learning, 2016
- Antoine Bordes et al., Learning End-to-End Goal-Oriented Dialog, 2016
- Denny Britz et al., Massive Exploration of Neural Machine Translation Architectures, 2017
- Nils Reimers, Iryna Gurevych, Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks, 2017
- Stephen Merity et al., Regularizing and Optimizing LSTM Language Models, 2017
- Tao Lei, Yu Zhang, Training RNNs as Fast as CNNs, 2017
FAQ
How are recurrent neural networks different from convolutional neural networks?
What is the difference between Recurrent Neural Networks and Recursive Neural Networks?
What is the difference between LSTM and GRU for RNNs?
What’s so great about LSTMs?
What is masking in a Recurrent Neural Network?
What is the attention mechanism introduced in RNNs?
Is LSTM turing complete?
When should one decide to use a LSTM in a Neural Network?
Why doesn’t LSTM forget gate cause a vanishing/dying gradient?
What is the difference between states and outputs in an LSTM?
Is it possible to do online learning with LSTMs?
Originally published at Recast.AI Blog.