NLP must reads

Musings after a crazy reading spree

Published in

Nurture.AI

9 min readJun 8, 2018

Language is the blood of the soul into which thoughts run and out of which they grow.
‒ Oliver Wendell Holmes

Have you ever wondered how Google Translate or Siri works? How do machines understand what we say? Questions like these fall under the field of Natural Language Processing (NLP), which is one of the most exciting areas of research today. In this post I will share the common AI architectures used in NLP tasks.

Whenever AI is mentioned, what comes in mind is often some evil looking terminator trying to destroy the world. But if you don’t consider all the mess that come from deployment and cleaning data, AI models are essentially predictive models that run on mathematics. Being a mathematical model, it requires mathematical inputs. Hence, it makes no sense if you feed in a string of words and expect the model to understand it.

There is an entire field dedicated to finding the right mathematical representation of words (but then again, how do you define “right”?), which I will discuss in the next section.

Representation of words

The use of word representations… has become a key “secret sauce” for the success of many NLP systems in recent years…- Luong et al. (2013)

A common approach is to convert a word to a vector (list of numbers); according to popular opinion, a good word representation is ideally one that takes into account syntax, semantics and computational requirements. A distributed representation attempts to achieve this, as discussed in this paper:

A Neural Probabilistic Language Model

What is a distributed representation. A good way to gain intuition of it is to compare it with a sparse representation:

Sparse representation (left) versus distributed representation (right). Image taken from this quora post.

Why a distributed representation. Using a sparse representation for words (such as one hot encoding) in a corpus leads to the curse of dimensionality, i.e the dimension of word vector increases with vocabulary size (as a comparison, think of how the above diagram will be like if we had more patterns). Therefore, a distributed representation is more preferable.

Efficient Estimation of Word Representations in Vector Space

Introduces the word2vec model. It uses a two-layer neural net that converts text into vectors, either by predicting a word based on the context words that surround it (continuous bag of words, CBOW) or predicting context words based on a word (Skip-gram).

GloVe: Global Vectors for Word Representation — Stanford NLP

A global approach. Vector representation of words that takes into account word-word co-occurrence probabilities over the entire corpus. It overcomes the weakness of word2vec models, where only neighboring words are considered and does not take into account the statistics of the entire corpus.

Now that we have a mathematical representation of words, what do we feed it to? Enter the RNN.

The RNN

Language is sequential, hence it is no surprise that commonly used models in NLP problems are of a similar nature. A popular sequential model is the RNN, which Andrej Karpathy called “magical” in his blog post. You can think of it as a machine that gets fed with one word at a time. It remembers all the words you have previously fed it in a “memory” (sometimes called the hidden vector).

Here are my main takeaways from the article:

The main problem with Vanilla Neural Networks and Convolutional Networks is that their input and output vectors(words) must be of fixed length.
Recurrent nets can be used when the inputs and output sizes are not fixed. Think language translation tasks, where the length of sentences to be translated varies.

ili is a portable translating device designed to help travellers navigate a foreign country. All you have to do is to speak to its microphone and it will output your speech in another language, taking as little as 0.2 seconds.

A limitation of the RNN is that it can be computationally expensive if it has too much “memory”, i.e if the size of the hidden vector is too large. An attempt to solve this is to perform computations in an external memory system called the Neural Turing Machine. It has an external memory system that allows it to read and write to specific memory locations. The NTM is shown to outperform a vanilla LSTM in tasks involving copying and sorting data sequences. However, I don’t hear much of people using NTM in language modeling tasks. Tell me if you don’t agree.

Language Translation Tasks

Neural machine translation by jointly learning to align and translate

What is Neural Machine Translation? The use of neural networks for translation tasks. A popular approach in this is the use of an encoder and decoder system, as discussed in this (Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation) paper. However, the system becomes a problem when the input sequence is too long, as the network has to compress too much information into a vector.

Encoder and decoder in action, using the Red Queen’s favourite quote. Dark purple circles are outputs of the decoder. They are formed from outputs of the encoder (light purple circles) and previous decoder outputs. Arrows for only the first two words are shown for simplicity.

Main insight. In language translation tasks, there is an “alignment” between the an input word and output word. Concretely, each translated word is more related to certain words in the input text.

Novel technique introduced in this paper. Introduces a new model called the RNNSearch, where the decoder has an attention mechanism. This relieves the encoder from having to store all information in the source sentence.

Issues remaining. This paper does not address scenarios with unknown or rare words.

Translating between French and English. A lighter colour indicates a closer relationship between the word pair. Image taken from paper.

Neural Machine Translation with Recurrent Attention Modeling

Novel technique introduced in this paper. Improves the model introduced in the paper above by taking into account attention history of each word.

Results. The improvised model outperforms RNNSearch in English-German and Chinese-English translation tasks.

Effective approaches to attention-based neural machine translation

What is new. Introduces two attention-based models — one with a global approach, where all sources words are attended; another with a local approach, where only a subset of source words are considered at a time.

How does it compare to existing models? The concept of global attention is similar to the one used in the RNNSearch model, but with some simplifications. For local attention, it is a mix of soft and hard attentional models as discussed in this (Show, attend and tell: Neural image caption generation with visual attention) paper.

Attention is All You Need

What is different this time? Typical Neural Machine Translation models use RNN as the underlying framework of the encoder-decoder system. This approach has two main problems: (1) parallel processing is impossible due to the sequential nature of RNNs; (2) RNNs are unable to model long term dependencies.

Key insight. Translation tasks are more than just mapping a word to another. There are multiple relationships we need to be aware of, i.e dependencies(relationships) (1) among input words, (2) between input and output words, (3) among output words. Instead of learning these dependencies through the latent state of an encoder, we can do so via an attention mechanism.

A new type of attention. This paper introduces the transformer model, which replaces the RNN with a multi-head attention that consist of multiple attention layers.

Why did it receive so much attention. This is the first sequence mapping model that is based entirely on attention. Due to the absence of recurrent layers in the model, its training speed is significantly faster and outperforms even all previously reported ensembles.

From translator to summariser. The transformer model can be tweaked to summarise wikipedia articles (Generating Wikipedia by Summarizing Long Sequences).

Additional resources. An animation and explanation of the transformer can be found here.

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Motivation and goal. Typical Neural Machine Translation models are not scalable to large datasets, lack robustness and difficult to deploy. This translation model aims to overcome these issues and also outperform current state-of-the art in translation quality.

The GNMT. Taken from Google’s research blog.

Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

How did Google improve the previous model? GNMT, the previous state-of-the-art introduced by Google, is difficult to scale to many languages. By just introducing an additional token at the input sentence to specify the target language, the model is able to achieve zero-shot translation.

What is zero-shot translation? The ability to translate between language pairs never seen before by the model. For example, if the model is trained using Japanese⇄English and Korean⇄English examples, it is also able to perform Korean⇄Japanese translations during test time.

Maybe… There is a “universal language representation” discovered by the model?

Achieving Human Parity on Automatic Chinese to English News Translation (2018)

Main contributions. Published by Microsoft AI & Research, this paper: (1) introduces a new way to measure translation quality, i.e by human parity; (2) describes techniques used to achieve state-of-the-art in Chinese to English translation task.

Potential limitation. The techniques described might not be applicable to other language pairs.

Sentiment Analysis

Learning to Generate Reviews and Discovering Sentiment

What it is. Authors first use unsupervised learning to create a representation of review texts. They eventually found a unit of the learned representation conveys sentiment fairly accurately. Hence, the trained model can be adapted to become a sentiment classifier.

Similar approaches. Introducing “pretraining” steps for supervised learning problems is not new — see this paper (Semi-supervised sequence learning). Alternatively, one could optimise word embeddings such that they they capture sentiment information, as discussed in this (Refining Word Embeddings for Sentiment Analysis) paper.

The catch. The learned representations are sensitive to the data distribution they are trained on. For example, we might not be able to replicate the experiment using different training data.

Researchers found a unit of a text representation that conveys sentiment quite accurately. Green highlights indicate a positive sentiment, red indicates negative. Image taken from paper.

Gradient problems

To understand a piece of written article or audio, one has to connect present and past information, i.e capture long term dependencies. RNNs are the current state of the art methods to capture the dependencies using backpropagation. However, this leads to vanishing and exploding gradient problems, especially for deep networks and long sequential data. A solution is to use LSTMs, a more sophisticated variant of the RNN. However, it can become too complex to be trained. Therefore, most researchers stick to the RNN but with some smart training tweaks:

Bayesian RNN, using posterior approximation and truncated gradient to improve RNN performance
Truncate Backpropagation through time (BPTT) and introduce an additional unsupervised loss that forces a RNN to reconstruct or predict partial memories:

Truncating backpropagation and addition an auxiliary unsupervised loss

Use Recurrent Highway Networks, a variant of RNNs that incorporates highway layers, which enables training of very deep feedforward networks

Maybe RNNs aren’t that good after all?

Training RNNs are difficult, as they have demanding memory and computational requirements. There has been a growing line of thought that advocates the use of CNN or attention over RNN models:

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

This paper is a systematic evaluation of recurrent and convolutional architectures, which are commonly used in sequence modelling problems. It demonstrates that the latter outperforms the former, and concludes that the use of recurrent networks in sequence modeling tasks should be reconsidered.

Very Deep Convolutional Networks for Text Classification
Introducing a new architecture for NLP problems that consists of many convolution and max pooling layers. It emphasizes that architecture depth contributes to better results in NLP problems.

Concluding Thoughts

All models are wrong; some models are useful.
- George E. P. Box

The human language is sophisticated.

Unlike images that can be represented in plain-fact numeric pixels, language is more than just squiggly markings assembled together on paper or read aloud. Somehow between the intertwining characters and words, there is a subtle hint of emotion, history and culture, which mathematical models may be a little too simple to understand (or do they really understand?).

Seems like it might take some time for an advanced chatbot (like Samantha from the movie Her) to happen.