Natural Language Processing Series

Part 1: A Brief History of Natural Language Processing

Published in

Cape AI

8 min readOct 7, 2020

This is the first blog post in a series focusing on the wonderful world of Natural Language Processing (NLP)! In this post we present you with the lay of the land — describing seven important milestones which have been achieved in the NLP field over the last 20 years. This is largely inspired by Sebastian Ruder’s talk at the 2018 Deep Learning Indaba which took place in Stellenbosch.

Short disclaimer before we begin: This post is heavily skewed towards neural network-based advancements. Many of these milestones, however, were built on many influential ideas presented by non-neural network-based work during the same era, which, for brevity purposes, have been omitted from this post.

Milestones in the field of Natural Language Processing

1. Neural Language Models (2001)

It’s 2001 and the field of NLP is quite nascent. Academics all around the world are beginning to think more about how language could be modelled. After a lot of research, Neural Language models are born. Language modelling is simply the task of determining the probability of the next word (often referred to as a token) occurring in a piece of text given all the previous words. Traditional approaches for tackling this problem were based on n-gram models in combination with some sort of smoothing technique [1]. Bengio et al. [2] were the first to propose using a feed-forward neural network, a so-called word “lookup-table”, for representing the n previous words in a sequence as illustrated in Figure 1. Today, this is known as word embeddings.

Figure 1: The first feed-forward neural network used for language modelling [1]

2. Multi-task Learning (2008)

Excitement and interest grows steadily in the years following Neural Language models. Advances in computer hardware allow researchers to push the boundaries on language modelling, giving rise to new NLP methods. One such method is multi-task learning. The notion of multi-task learning involves training models to solve more than one learning task, while also using a set of shared parameters. As a result, models are forced to learn a representation that exploits the commonalities and differences across all tasks.

Collobert and Weston [3] were the first to apply a form of multi-task learning in the NLP domain back in 2008. They trained two convolutional models with max pooling to perform both part-of-speech and named entity recognition tagging, while also sharing a common word lookup table (or word embedding), as shown in Figure 2. Years later, their paper was highlighted by many experts as a fundamental milestone in deep learning for NLP and received the Test-of-time Award at the 2018 International Conference on Machine Learning (ICML).

Figure 2: The first multi-task model sharing a common word lookup table [3]

3. Word Embeddings (2013)

If you’ve had any exposure to NLP, the first thing you have probably come across is the idea of word embeddings (or more commonly known as word2vec). Although we have seen that word embeddings have been used as far back as 2001, in 2013 Mikolov et al. [4] proposed a simple but novel method for efficiently training these word embeddings on very large unlabeled corpora which ultimately led to their wide-scale adoption.

Word embeddings attempt to create a dense vector representation of text, and addresses many challenges faced with using traditional sparse bag-of-words representation. Word embeddings were shown to capture every intuitive relationship between words such as gender, verb tense and country capital, as illustrated in Figure 3.

Figure 3: The intuitive relationships captured by word embeddings [4]

4. Neural Networks for NLP (2013)

Looking back, 2013 appeared to be an inflection point in the NLP field, as research and development grew exponentially thereon. The advancements in word embeddings ultimately sparked the wider application of neural networks in NLP. The key challenge that needed to be addressed was architecturally allowing sequences of variable lengths to be inputted into the neural net which ultimately lead to three architectures emerging, namely: recurrent neural networks (RNNs) (which were soon replaced by long-short term memory (LSTM) networks), convolutional neural networks (CNNs), and recursive neural networks. Today, these neural network architectures have produced exceptional results and are widely used for many NLP applications.

5. Sequence-to-Sequence Models (2014)

Soon after the emergence of RNNs and CNNs for language modelling, Sutskever et al. [5] were the first to propose a general framework for mapping one sequence to another, which is now known as sequence-to-sequence models. In this framework, an encoder network processes an input sequence token by token and compresses it into a vector representation, represented by the blue layers in Figure 4. A decoder network (represented by the red layers) is then used to predict a new sequence of output tokens based on the encoder state, which takes every previously predicted token as input.

Figure 4: A visual representation of a sequence-to-sequence model used for translation [6]

This architecture is particularly useful in tasks such as machine translation (MT) and natural language generation (NLG). It’s no surprise that, in 2016, Google announced that it is in the process of replacing all of its statistical-based MT systems with neural MT models [7]. Additionally, since the decoder model can be conditioned on any arbitrary representation, it can also be used for tasks like generating captions for images [8].

6. Attention Mechanisms (2015)

Although useful in a wide range of tasks, sequence-to-sequence models were struggling with being able to capture long-range dependencies between words in text. In 2015, the concept of attention was introduced by Bahdanau et al. [9] as a way of addressing this bottleneck. In essence, attention in a neural network is a mechanism for deciding which parts of the input sequence to attend to when routing information. Various attention mechanisms have also been applied in the computer vision space for image captioning [10], which also provides a glimpse into the inner workings of the model, as is seen in Figure 5.

Figure 5: A visual representation of the attention mechanism in an image captioning model[10]

Attention is not only restricted to the input sequence and can also be used to focus on surrounding words in a body of text — commonly referred to as self attention — to obtain more contextual meaning. This is at the heart of the current state-of-the-art transformer architecture, proposed by Vaswani et al. [11] in 2017, which is composed of multiple self-attention layers. The transformer sparked an explosion of new language model architectures (and an inside joke among AI practitioners regarding Sesame Street Muppets), the most notable being Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT).

Figure 6: The various language model architectures based on the transformer model [12]

7. Pre-trained Language Models (2018)

Dai & Le [13] were the first to propose using pre-trained language models in 2015 but this notion was only recently shown to be beneficial across a broad range of NLP-related tasks. More specifically, it was shown that pre-trained language models could be fine-tuned on other data related to a specific target task [14, 15]. Additionally, language model embeddings could also be used as features in a target model leading to significant improvements over the then state-of-the-art models [16], as shown in Figure 7.

Figure 7: Improvements over state-of-the-art models when employing language model embeddings [16]

8. Where we are today and looking forward …

Nowadays, there exists an array of initiatives aimed at open-sourcing many large state-of-the-art pre-trained models. These models can be fine-tuned to perform various NLP-related tasks like sequence classification, extractive question answering, named entity recognition and text summarization (to name a few).

NLP is advancing at an incredible pace and is giving rise to global communities dedicated to solving the world’s most important problems through language understanding.

Stay tuned to this series to learn more about the awesome world of NLP as we share more on the latest developments, code implementations and thought-provoking perspectives on NLP’s impact on the way we interact with the world. It’s an extremely exciting time for anyone to get into the world of NLP!

About the author: Shane van Heerden is a Data Scientist and instrumental NLP researcher at Cape AI. Shane holds a PhD in Artificial Intelligence from the University of Stellenbosch and has a strong interest in NLP, Operations Research and Deep Learning.
About Cape AI: Cape AI are an Artificial Intelligence company based in South Africa and the Netherlands, focusing on AI consulting as well as Research and Development in the fields of Computer Vision and Natural Language Processing.

Connect with us and connect with Shane! We love engaging with academic and business communities to share knowledge and co-create valuable AI.

References

Kneser, R., & Ney, H. (1995, May). Improved backing-off for m-gram language modeling. In icassp (Vol. 1, p. 181e4).
Bengio, Y., Ducharme, R., & Vincent, P. (2001). A Neural Probabilistic Language Model. Proceedings of NIPS.
Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing. In Proceedings of the 25th International Conference on Machine Learning (pp. 160–167).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems.
Abigail & Luong, Minh-Thang & Manning, Christoper. (2016). Compression of Neural Machine Translation Models via Pruning. 291–301. 10.18653/v1/K16–1029.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V, Norouzi, M., Macherey, W., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. ArXiv Preprint ArXiv:1609.08144.
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156–3164).
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR 2015.
Xu, K., Courville, A., Zemel, R. S., & Bengio, Y. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of ICML 2015.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems.
Wang, X. and Zhang Z., (2019). PMLpapers. Available at: https://github.com/thunlp/PLMpapers
Dai, A. M., & Le, Q. V. (2015). Semi-supervised Sequence Learning. Advances in Neural Information Processing Systems (NIPS ’15). Retrieved from http://arxiv.org/abs/1511.01432
Ramachandran, P., Liu, P. J., & Le, Q. V. (2017). Unsupervised Pretraining for Sequence to Sequence Learning. In Proceedings of EMNLP 2017.
Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. In Proceedings of ACL 2018. Retrieved from http://arxiv.org/abs/1801.06146
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of NAACL-HLT 2018.