A Brief History of Natural Language Processing — Part 2

Antoine Louis
6 min readJul 7, 2020

Natural language processing (NLP) is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis (Liddy, 2001). The purpose of these techniques is to achieve human-like language processing for a range of tasks or applications. Although it has gained enormous interest in recent years, research in NLP has been going on for several decades dating back to the late 1940s. This review divides its history into two main periods: NLP before (part 1) and during (part 2) the deep learning era.

(If you missed part 1, check NLP before the Deep Learning Era.)

Part 2 — NLP during the Deep Learning Era

The big stages of NLP in the deep learning era.

Starting in the 2000s, neural networks begin to be used for language modeling, a task which aims at predicting the next word in a text given the previous words. In 2003, Bengio et al. proposed the first neural language model, that consists of a one-hidden layer feed-forward neural network. They were also one of the first to introduce what is now referred as word embedding, a real-valued word feature vector in R^d. More precisely, their model took as input vector representations of the n previous words, which were looked up in a table learned together with the model. The vectors were fed into a hidden layer, whose output was then provided to a softmax layer that predicted the next word of the sequence. Although classic feed-forward neural networks have been progressively replaced with recurrent neural networks (Mikolov et al., 2010) and long short-term memory networks (Graves, 2013) for language modeling, they remain in some settings competitive with recurrent architectures, the latter being impacted by “catas- trophic forgetting” (Daniluk et al., 2017). Furthermore, the general building blocks of Bengio et al.’s network are still found in most neural language and word embedding models nowadays.

In 2008, Collobert and Weston applied multi-task learning, a sub-field of machine learning in which multiple learning tasks are solved at the same time, to neural networks for NLP. They used a single convolutional neural network architecture (CNN; LeCun et al., 1999) that, given a sentence, was able to output many language processing predictions such as part-of-speech tags, named entity tags and semantic roles. The entire network was trained jointly on all the tasks using weight-sharing of the look-up tables, which enabled the different models to collaborate and share general low-level information in the word embedding matrix. As models are being increasingly evaluated on multiple tasks to gauge their generalization ability, multi-task learning has gained in importance and is now used across a wide range of NLP tasks. Also, their paper turned out to be a discovery that went beyond multi-task learning. It spearheaded ideas such as pre-training word embeddings and using CNNs for text, that have only been widely adopted in the last years.

In 2013, Mikolov et al. introduced arguably the most popular word embedding model: Word2Vec. Although dense vector representations of words have been used as early as 2003 (Bengio et al.), the main innovation proposed in their paper was an efficient improvement of the training procedure, by removing the hidden layer and approximating the objective. Together with the efficient model implementation, these simple changes enabled large-scale training of word embeddings on huge corpora of unstructured text. Later that year, they improved the Word2Vec model by employing additional strategies to enhance training speed and accuracy. While these embeddings are not different conceptually than the ones learned with a feed-forward neural network, training on a very large corpus enables them to capture certain relationships between words such as gender, verb tense, and country-capital relations, which initiated a lot of interest in word embeddings as well as in the origin of these linear relationships (Mimno and Thompson, 2017; Arora et al., 2018; Antoniak and Mimno, 2018; Wendlandt et al., 2018). But what made word embeddings a mainstay in current NLP was the evidence that using pre-trained embeddings as initialization improved performance across a wide range of downstream tasks. Since then, a lot of work has gone into exploring different facets of word embeddings (as indicated by the staggering number of citations of the original paper, i.e. 19,071 citations at the time of writing). Despite many more recent developments, Word2Vec is still a popular choice and widely used today.

The year 2013 also marked the adoption of neural network models in NLP, in particular three well-defined types of neural networks: recurrent neural networks (RNNs; Elman, 1990), convolutional neural networks (CNNs), and recursive neural networks (Socher et al., 2013). Because of their architecture, RNNs became popular for dealing with the dynamic input sequences ubiquitous in NLP. But Vanilla RNNs were quickly replaced with the classic long-short term memory networks (LSTMs; Hochreiter and Schmidhuber, 1997), as they proved to be more resilient to the vanishing and exploding gradient problem. At the same time, convolutional neural networks, that were then beginning to be widely adopted by the computer vision community, started to get applied to natural language (Kalchbrenner et al., 2014; Kim, 2014). The advantage of using CNNs for dealing with text sequences is that they are more parallelizable than RNNs, as the state at every time step only depends on the local context (via the convolution operation) rather than all past states as in the RNNs. Finally, recursive neural networks were inspired by the principle that human language is inherently hierarchical: words are composed into higher-order sentences, which can themselves be recursively combined according to a set of production rules. Based on this linguistic perspective, recursive neural networks treated sentences as trees rather than as a sequences. Some research (Tai et al., 2015) also extended RNNs and LSTMs to work with hierarchical structures.

In 2014, Sutskever et al. proposed sequence-to-sequence learning, a general end-to-end approach for mapping one sequence to another using a neural network. In their method, an encoder neural network processes a sentence symbol by symbol, and compresses it into a vector representation. Then, a decoder neural network predicts the output sequence symbol by symbol based on the encoder state and the previously predicted symbols that are taken as input at every step. Encoders and decoders for sequences are typically based on RNNs, but other architectures have also emerged. Recent models include deep-LSTMs (Wu et al., 2016), convolutional encoders (Kalchbrenner et al., 2016; Gehring et al., 2017), the Transformer (Vaswani et al., 2017), and a combination of an LSTM and a Transformer (Chen et al., 2018). Machine translation turned out to be the perfect application for sequence-to-sequence learning. The progress was so significant that Google announced in 2016 that it was officially replacing its monolithic phrase-based machine translation models in Google Translate with a neural sequence-to-sequence model.

In 2015, Bahdanau et al. introduced the principle of attention, which is one of the core innovations in neural machine translation (NMT) and the key idea that enabled NMT models to outperform classic sentence-based MT systems. It basically alleviates the main bottleneck of sequence-to-sequence learning, which is its requirement to compress the entire content of the source sequence into a fixed-size vector. Indeed, attention allows the decoder to look back at the source sequence hidden states, that are then combined through a weighted average and provided as additional input to the decoder. Attention is potentially useful for any task that requires making decisions based on certain parts of the input. For now, it has been applied to constituency parsing (Vinyals et al., 2015), reading comprehension (Hermann et al., 2015), and one-shot learning (Vinyals et al., 2016). More recently, a new form of attention has appeared, called self-attention, being at the core of the Transformer architecture. In short, it is used to look at the surrounding words in a sentence or paragraph to obtain more contextually sensitive word representations.

The latest major innovation in the world of NLP is undoubtedly large pretrained language models. While first proposed in 2015 (Dai and Le), only recently were they shown to give a large improvement over the state-of-the-art methods across a diverse range of tasks. Pre-trained language model embeddings can be used as features in a target model (Peters et al., 2018), or a pre-trained language model can be fine-tuned on target task data (Devlin et al., 2018; Howard and Ruder, 2018; Radford et al., 2019; Yang et al., 2019), which have shown to enable efficient learning with significantly less data. The main advantage of these pre-trained language models comes from their ability to learn word representations from large unannotated text corpora, which is particularly beneficial for low-resource languages where labelled data is scarce.