A History of NLP Breakthroughs

Thomas Packer, Ph.D.
TP on CAI
Published in
4 min readOct 5, 2022

This story is a rough-draft. Check back later for the fully-polished story.

Breakthroughs don’t happen overnight nor in an vacuum. There are definite threads of research within the discipline and communities of natural language processing (NLP) that highlight how breakthroughs belong to themes, categories, and ongoing work toward bigger goals. It may come as a surprise to newcomers that the history of NLP did not start with deep learning or word embeddings. Nor should we limit NLP to language models; their success has been shown to rely on (even built-in) world models.

Below, I will list the main NLP breakthroughs that I’m aware of, such as word embeddings and transformers, no more than roughly one breakthrough per year of NLP history. For each one, I will describe the basic intuitions behind the challenge, the solution, the applications, how it relates to other breakthroughs, the follow-on work that came after it, and the (seminal) paper associated with the breakthrough.

Photo by Ivana Cajina on Unsplash

I summarize the following breakthroughs and how they relate to each other:

N-gram Language Model

Problem

There are a number of language analysis and generation tasks that can be decomposed into a probability formula. One part of many of those formulas is the probability of a given text. How can we calculate the probability that a given text exists?

Solution

Models that assign probabilities to sequences of words are called language models or LMs. The simplest such model is the n-gram language model. It splits any sequence of text into pieces of n words each. The first n-1 words is the context, the last word is assigned a probability given that context. Based on the chain rule of probability, if you multiply the conditional probability of all words in the sequence together, you get the probability of the whole sequence.

Where do you get the probability of each n-gram? The fancy term is maximum likelihood estimation. That sounds as complicated as a huge parametric statistical model, but it’s actually just as simple as counting up occurrences of words, dividing counts to normalize them into probabilities, and storing then in a lookup table.

So simple. Imperfect? Definitely. But also surprisingly hard to beat with most complicated methods, as I found out myself during my masters thesis research in computer science.

Papers and Pages

This chapter from Jurafsky on N-gramm Language Models starts with a quote showing reasonably fluent language generation has been around since the old days. And even back then, they generated fluent nonsense.

N-gram on Wikipedia

HMM

MEMM

CRF

TF-IDF

Papers and Pages

TF-IDF on Wikipidia.

LSA

Solution

Applying singular-value decomposition (matrix factorization) to words in a corpus, resulting in word embeddings.

Papers and Pages

Using Latent Semantic Analysis to Improve Access to Textual Information appears to be the first paper about LSA, published in 1988. Also available at dl.acm.org.

LSA was also patented in 1988 (US Patent 4,839,853) by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter. In the context of its application to information retrieval, it is sometimes called Latent Semantic Indexing (LSI).

Also see An Introduction to Latent Semantic Analysis and Latent Semantic Analysis on Wikipieda.

Neural Word Embedding

Problem

Most NLP tasks, from word sense disambiguation to text classification, treat words as a convenient and natural unit of input. But for a computer to make sense of words — and there are many of them — they must be reduced to a set of features, usually numbers. You could assign each word a unique ID, but that is a lot of IDs and the IDs don’t tell you anything useful about the words. You could represent them by manually-chosen features, but that is a lot of work and is bound to be of limited applicability making the work harder by having to be repeated for each application.

For neural nets and other ML models, it helps to represent words using a vector of numbers, often between 0.0 and 1.0. The easiest way to do that is one-hot encoding where each dimension of the very long vector corresponds to each word so that the position of the lone 1.0 value in a vector indicates which word it is. But this is little better than assigning a unique ID to each word because nothing meaningful can be understood by the value assigned to each word.

Solution

LSTM

Attention

Transformer

Problem

Solution

Important Follow-on Work

Paper

Attention Is All You Need is the seminal transformer paper published at NeurIPS in 2017.

BERT

ChatGPT

Join the CAI Dialog on Slack at cai-dialog.slack.com

About TP on CAI

Other stories in TP on CAI you may like:

--

--

Thomas Packer, Ph.D.
TP on CAI

I do data science (QU, NLP, conversational AI). I write applicable-allegorical fiction. I draw pictures. I have a PhD in computer science and I love my family.