Sam Havens
Nov 6, 2018 · 11 min read

Interesting things are happening very quickly in the field of Natural Language Processing. To help me process them, and to try to be of use to the community, I will try to summarize them here. The target audience for this post is machine learning researchers and practitioners with some familiarity of NLP. Most of the work I reference is from 2017–18, so I think “recent” is a fair characterization.

This article will focus on two topics: fine-tunable language models and question answering systems. There is much more going on in NLP, but these are the fields that I am able to talk about. Also, this article could honestly be broken into two already. An example of a topic that should be covered in an article with this title, but isn’t, is Multi-Task Learning.

Fine-Tunable Language Models

As others have noted, a huge development in 2018 has been the introduction of language models which can be pre-trained on a large, unlabelled corpus (all of Wikipedia for a given language) then fine-tuned for specific domains. In a sense these are similar to word2vec and GLoVE, but whereas those are embeddings — a matrix, essentially — these projects are all full-blown models.

As Jeremy Howard says on the blog,

Very simple transfer learning using just a single layer of weights (known as embeddings) has been extremely popular for some years, such as the word2vec embeddings from Google. However, full neural networks in practice contain many layers, so only using transfer learning for a single layer was clearly just scratching the surface of what’s possible…

This idea has been tried before, but required millions of documents for adequate performance. We found that we could do a lot better by being smarter about how we fine-tune our language model. In particular, we found that if we carefully control how fast our model learns and update the pre-trained model so that it does not forget what it has previously learned, the model can adapt a lot better to a new dataset. One thing that we were particularly excited to find is that the model can learn well even from a limited number of examples. On one text classification dataset with two classes, we found that training our approach with only 100 labeled examples (and giving it access to about 50,000 unlabeled examples), we were able to achieve the same performance as training a model from scratch with 10,000 labeled examples.

Models of this type include:

ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.

We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas: transformers and unsupervised pre-training. These results provide a convincing example that pairing supervised learning methods with unsupervised pre-training works very well

…Our new paper, which shows how to classify documents automatically with both higher accuracy and less data requirements than previous approaches. We’ll explain in simple terms: natural language processing; text classification; transfer learning; language modeling; and how our approach brings these ideas together. If you’re already familiar with NLP and deep learning, you’ll probably want to jump over to our NLP classification page for technical links.

BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.

The most recent and best performing of these is BERT, which is implemented with Transformers (see “Attention is all you need,” “The Annotated Transformer,” and The Illustrated Transformer to get an understanding of Attention and Transformers. ). GPT is also implemented with Transformers, whereas ULM-FiT and ELMo use AWD-LSTM and biLSTMs respectively.

From the BERT paper,

What’s exciting is that each of these models came out within the past year and each set new records for multiple different tasks, including question answering, text classification, paraphrasing, and entailment.

Why LSTMs and attention? LSTMs are built in such a way that allows for persistent state (memory) to carry through, with modifications (forgetting and adding to the memory), hence their usefulness (I’ve seen papers on anything from language, to predicting music, to predicting if someone will be readmitted to the hospital, based on their chart data). In contrast, Transformers attempt to model the idea of attention (hence the name of the paper, “Attention is all you need”). Transformers weren’t the first attention-based model, but they are clearly very good (given BERT’s improved performance over LSTM-based models like ELMo and ULM-FiT). One nice property they have is that a lot of the operations they perform can be done in parallel (as opposed to in sequence, like LSTMs).

A good, intuitive explanation of LSTMs can be found in Christopher Olah’s blog post, Understanding LSTM Networks. This blog post is (I believe) the source of the ubiquitous image that is used when explaining LSTM architecture:


Question Answering

Though mentioned in the section above, question answering deserves its own section. Question Answering systems are undergoing a shift (the same shift that most branches of AI are going through) from a pipeline of engineered features to end-to-end deep neural networks.

Two researchers to keep an eye on in this field, both out of Microsoft Research, are Scott Wen-tau Yih (now at the Allen Institute for AI) and Jianfeng Gao. There are probably many others, and my omission of them reflects my limited bandwidth, not their accomplishments.

Broadly, there are two subfields in this area which are determined by the source of truth for answers: structured and unstructured data. Structured data may be a knowledge base, SQL, table(s); unstructured data is usually plain text, though in the case of visual question answering, it can be images. [I’m not sure if there are visual question answering systems that accept a series of images (video) as input, but if not, I’d bet that someone is working on it. They are probably using LSTMs, or some attention mechanism, due to the time-dependent nature of video (this is a theme of this article)]. Approaches vary between the two subfields. The most famous benchmark using unstructured text as a source of truth is SQuAD, The Stanford Question Answering Dataset. There doesn’t seem to be as much of a baseline in structured question answering, but Spider, out of Yale, looks promising.

Question Answering with Structure Data

Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base is a reference for QA with an underlying knowledge base. A more recent work, Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access released an implementation, KB-InfoBot, which by default uses a knowledge base of IMDB data.

In Search-based Neural Structured Learning for Sequential Question Answering, the backing data is semi-structured — tables from Wikipedia. More on this work later.

Of course, our favorite data format is SQL. And, if you have data in a SQL database, there is work being done to translate natural language questions into SQL. Spider is “a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students.” This feels like it has the potential to kill an entire class of applications:

Some examples from the Spider project

Question Answering with Unstructured Data — Machine Reading Comprehension

In addition to SQuAD1.1, which is the go to reading comprehension data set, this year saw the release of SQuAD2.0, CoQA, QuAC, MS MARCO, and DuoRC. At a high level, each entry in the data set involves a short corpus of text, such as

This is the story of a young girl and her dog. The young girl and her dog set out a trip into the woods one day. Upon entering the woods the girl and her dog found that the woods were dark and cold. The girl was a little scared and was thinking of turning back, but yet they went on. The girl's dog was acting very interested in what was in the bushes up ahead. To both the girl and the dog's surprise, there was a small brown bear resting in the bushes. The bear was not surprised and did not seem at all interested in the girl and her dog. The bear looked up at the girl and it was almost as if he was smiling at her. He then rested his head on his bear paws and went back to sleep. The girl and the dog kept walking and finally made it out of the woods. To this day the girl does not know why the bear was so friendly and to this day she has never told anyone about the meeting with the bear in the woods.

Though many articles are more interesting than the above example. Questions are colocated with the corpus, as are acceptable answers. You can read a comparison of CoQA, SQuAD2.0, and QuAC if you are interested in the details, but roughly they fall into three categories:

  1. Unanswerable questions, that is, questions where the correct answer is “answer not available in corpus.”
  2. Multi-turn interactions (the next section of this article)
  3. Abstractive answers — questions where the correct answer can be inferred, but not directly extracted, from the corpus

From the comparison paper:

The coverage of unanswerable questions is complementary among datasets; SQuAD 2.0 covers all types of unanswerable questions present in other datasets, but focuses more on questions of extreme confusion, such as false premise questions, while QuAC primarily focuses on missing information. QuAC and CoQA dialogs simulate different types of user behavior: QuAC dialogs often switch topics while CoQA dialogs include more queries for details and cover twice as many sentences in the context as QuAC dialogs. Unfortunately, no dataset provides significant coverage of abstractive answers beyond yes/no answers, and we show that a method can achieve an extractive answer upper bound of 100 and 97.8 F1 on QuAC and CoQA , respectively.

Some interesting algorithms of note that are architected specifically for question answering include FlowQA, SAN (Stochastic Answer Networks), and Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering. However, as noted above, general purpose architectures like BERT seem to outperform, or at least perform competitively, without specialized architecture.

Sequential Question Answering

From the standpoint of building dialogue systems with the intention of them being used, sequences of simple questions are much more relevant than complex, one-off questions. As the authors write in the paper introducing QuAC,

QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as we show in a detailed qualitative evaluation.

This is the beginning of quantifying exactly what about human dialogue is so hard for computers to emulate. Compare with the similar CoQA data set to get a feel for the similar direction the separate authors have converged on:

The unique features of CoQA include 1) the questions are conversational; 2) the answers can be free-form text; 3) each answer also comes with an evidence subsequence highlighted in the passage; and 4) the passages are collected from seven diverse domains. CoQA has a lot of challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.

Here is a chart from the QuAC paper contrasting the various question answering data sets:

Side note: ROUGE and BLEU

You see references to ROUGE and BLEU frequently in NLG (natural language generation) and NMT (neural machine translation) tasks. Often those references are dismissive, as in Re-evaluating the Role of BLEU in Machine Translation Research

We argue that the machine translation community is overly reliant on the Bleu machine translation evaluation metric. We show that an improved Bleu score is neither necessary nor sufficient for achieving an actual improvement in translation quality, and give two significant counterexamples to Bleu’s correlation with human judgments of quality.

However, despite their shortcomings, they are certainly prevalent. This Stackoverflow answer gave a concise and memorable explanation of the two:

Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.

Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.

Distant Supervision — Babble Labble

Honestly, just watch the video. If you don’t think this is the future, tell me what is!


If I were to bet on something for the coming year, it would be that we will see more benchmarks set by architectures using attention in place of LSTMs. This is not an original position, see The Fall of RNN/LSTM. I also expect other techniques in the style of Babble Labble. Since a bottleneck for deep learning models is labelled data, I would expect high demand for anything that sidesteps that need.

Also, the releases of all the question answering data sets (Spider, SQuAD2.0, QuAC, CoQA, MS MARCO, and Duo RC) are a great sign for the future of question answering systems, which I expect to improve substantially in 2019. As the DuoRC authors note,

Given that there are already a few datasets for RC, a natural question to ask is “Do we really need any more datasets?”. We believe that the answer to this question is yes. Each new dataset brings in new challenges and contributes towards building better QA systems. It keeps researchers on their toes and prevents research from stagnating once state-of-the-art results are achieved on one dataset.

Another dynamic that has the potential to get interesting is theory catching up with practice. It is well known that machine learning using neural networks is an empirical field. It’s rare in the modern era for engineering to be this far ahead of the mathematical underpinnings. Measuring the intrinsic dimension of objective landscapes, which came out of Uber AI Labs this April (If you haven’t seen it, the blog post introducing this paper is an amazing example of communicating research to the public) is a good example of what that catching up looks like.


Thanks for reading this. If you disagree, please correct me in the comments or on Twitter. If you liked this, please consider sharing this article or following me on Medium/Twitter.

If you think that this technology has the potential to be disruptive to the automotive industry, check out (I’m the CTO).

I’d like to thank Nahid Alam for the encouragement to write this, and moonagedaydream for the edits.


Discussions about tech, automotive, and the future

Thanks to moonagedaydream

Sam Havens

Written by

I used to teach and study math and physics, now I do Natural Language Processing.



Discussions about tech, automotive, and the future

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade