Deep learning models are turning out to be great in extracting representations and features that can be used to predict future occurrences, finding correlations between the data, understand the behavior of the data, get topological insights, and much more.
In the realm of natural language processing, the evolution of deep learning algorithms are tremendous. These algorithms are able to process sequential data (which are usually unstructured, discreet, and huge in quantity) and are able to predict the next sequence on a contextual basis. These sequences usually belong to data that is inclined to the human languages, which include speech recognition, natural language understanding, and natural-language generation.
Natural Language Processing or NLP is a field of Artificial Intelligence that gives machines the ability to read, understand and derive meaning from human languages.
One reason why NLP is evolving faster is because of the availability of the data and huge improvements in computations and interactive communities, which are allowing practitioners such as myself to explore the realm of NLP and to contribute meaningful insights and finding to the NLP community both for beginners and experts.
This article aims to provide meaningful insights on how we started NLP research, the motive behind it, the challenges we faced along the way, where we are, and where we are going.
*The article does not intend to provide any mathematical intuition but a general idea of what NLP and its key components are.
History of NLP
NLP was initiated (approx.) from the time of Alan Turing. His famous BOMBE, the electro-mechanical machine, designed during WWII helped the British to decipher German enigma codes and save many lives and Europe eventually. Enigma was a device that was used by the German military command to encode strategic messages before and during WWII. BOMBE remains the cornerstone of what modern-day NLP research rests upon. The idea and the simplicity of carefully tailoring the best patterns or representations of the code and feeding into the machine gave the cryptologists an upper hand to break the enigma code.
Following that, Alan Turing published an article titled “Computing Machinery and Intelligence” which proposed a criterion called the “imitation game” also known as the Turing's test. This criterion basically introduces a method to check whether the algorithm can behave completely like a human or not. This behavior was evaluated based upon the premise that notes or text messages were exchanged between the two parties: a computer and a human, and a third party, being a human, evaluates the notes exchanged between them. If the third party is unable to classify which note belonged to whom then the computer is said to pass the Turing's test or otherwise.
Ever since then there was research going on to build a system that could pass Turing’s test but the research did not have a dedicated field. It was only in the 1950s that the branch of artificial intelligence was established at Dartmouth, and AI could have its own reputation and space to grow.
But before the NLP could leverage the power of artificial neural networks, it made itself known with the two different paradigms namely: Symbolic NLP and Statistical NLP followed by what we know as the neural NLP (which is the focus of this article).
Symbolic NLP (the 1950s — early 1990s)
“The premise of symbolic NLP is well-summarized by John Searle’s Chinese room experiment: Given a collection of rules (e.g., a Chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other NLP tasks) by applying those rules to the data it is confronted with”. — Excerpt from Wikipedia
It is important to understand that during this era both Symbolic NLP and machine learning algorithms relied heavily on hard-coded rules which meant carefully curating the features required to solve a particular problem.
We need to be clear of the fact that during this timestamp (i.e. 1950s to 1990s) backpropagation was introduced. And to be more precise it was discovered by several authors and rediscovered by Rumelhart, D., Hinton, G. & Williams, R in 1986.
Backpropagation laid a foundation upon which the dynamic programming paradigm rests upon. Although it was left dormant for a decade from the 2000s to the 2010s because of lack of data and computing power it made it come back from the 2010s. And ever since it is going strong.
Statistical NLP (the 1990s — 2010s)
As mentioned before most of the NLP and other ML algorithms like the perceptron were hard-coded, and NLP was trying to grow independently. But up until the concept of backpropagation came NLP leveraged the power of machine learning algorithms. With ML in its arsenal, NLP started to explore patterns from the language corpus and their statistical importance.
Statistical-based learning generally included approaches such as decision trees, hidden Markov models et cetera which took leverage of conditional probability.
By the 1990s companies like IBM were ahead in NLP research. Their successes came through statistical methods in the field of machine translation.
At the advent of the 2000s, there wasn’t any reason to fund most of the machine learning ideas because they would work for small data but when applied to slightly larger data (which increases the complexity) they would fail. However, during the mid-2000s a sudden surge of data began and the research on NLP and machine learning started to take shape. With the idea of backpropagation and deep learning coming into the picture more complex algorithms could be designed to extract patterns which could then be useful for machine translation.
Neural NLP (the 2000s — present)
The transition to neural networks came around the 1990s. During this time the idea of a multilayer perceptron or the feed-forward neural network was taking shape and was showing some promising results.
Since language has certain properties, most importantly sequence and structure, it was supposed to be modeled differently compared to images (which was very popular during that time). As it turned out that modeling language or sequential data will be more challenging because we do have to consider the context of the data as well as remember that sequential data are discrete. Contextual learning then became a new roadblock because sometimes the context of a new sentence can be derived from a paragraph of 100 words. Statistical models were not able to capture the context and the structure of a large sentence.
In 1995 a technique called n-grams was introduced by Kneser & Ney, where n refers to the number of words in a sentence taken into consideration to capture the context of the sentence. But it did not make much of a difference.
The first neural language model, a feed-forward neural network was proposed in 2001 by Bengio et al.
This model consists of a single input layer where each word is first transformed to a feature vector and stored in order of the occurrence in table C. The tables were also called lookup tables. These vectors were then fed into a hidden layer, whose output is then provided to a softmax layer.
As it turned out, feed-forward neural networks do not provide sufficient complexity to extract patterns and could not sustain the context because they used a fixed-length context that needed to be specified before training (Mikolov et al., 2010). Hence new architectures like RNNs by Mikolov et al. were developed in 2010 on top of Bengio’s Neural Language Model, which would share parameters and establish a connection between the inputs fed into the network at different time-stamps. More modified versions of RNNs were developed like the LSTMs, by Graves in 2013, which would preserve long-term and short-term memory.
RNNs and LSTMs open new doors of opportunities to understanding languages and the patterns it contains. LSTMs remains an inspiration of the modern-day language models which are ever so powerful to challenge Turing’s test.
Word Embedding 2013
The idea of representing words as vectors dates back to the early 1950s and in particular in the works of Zellig Harris, John Firth, and Ludwig Wittgenstein (Mandelbaum et al., 2016).
Word embeddings are unsupervised learned words that represent vectors whose relative similarities correlate with semantic similarity.
As usual, the early word embedding required hand-crafted methods which were tedious and unreliable. Other methods include vector space models (Galton et al., 1975), and stemming from the Information Retrieval (IR) community.
Bengio’s Neural Language Model captured two concepts together: word embedding and statistical approach of language model i.e. predicting the next word given a number of previous words. Bengio’s idea was to view the NLP problem from an unsupervised learning point of view. This would be done by first transforming the raw words vector into the embedding vectors before being fed into the network. The C that we learned about in the previous section is what is known as word embedding.
Over time this word embedding became a topic of research of its own. Mikolov et al.,2013 proposed removing the hidden layers to make the training more efficient. This was called a continuous bag-of-words as the order of words in history does not influence the projection.
Another word embedding model was continuous skip-gram. Rather than predicting the current word based on the context, it tries to maximize the classification of a word based on another word in the same sentence.
In other words, one predicts the center word based on the surrounding words, and the other does the opposite.
These techniques saw great improvements in language modeling and also reduced the computation costs and increased efficiency.
2013 — Neural networks for NLP
By 2013 a lot of advancement has been done in the field of NLP. Faster computing chips were being developed, the reservoir of data was exponentially growing thanks to the internet, and research in the field was diversifying tremendously. Those three factors also allowed the hybridization of various techniques to form a unified working algorithm that was fast, efficient, and less costly. It was also during this time that the involvement of the neural network was giving shape to a lot of research areas apart from NLP.
RNNs were becoming more powerful and dynamic. Its extension LSTMs could remember sufficiently longer context and was good in extracting patterns and also resolved some of the issues with the RNNs like the vanishing gradient descent and exploding gradient descent.
In the parallel world of computer vision, another algorithm was being developed called the CNN (LeCun 1989). It was introduced with the motive to understand patterns in the visible world, like images and videos, and it became the cornerstone of computer vision. The ability of CNN lies in the fact that it can capture patterns and representation from its grid-like operation called the convolutional operation and by 2013 it had performed exceptionally well in most of the computer vision tasks like image-recognition and classification tasks.
With the eagerness to explore the language world Kalchbrenner et al. in 2014 applied CNN to the language data. It was learned that CNN could offer much more feature extraction and also it captured a slightly wider range of context when receptive fields of CNNs were increased.
“An advantage of convolutional neural networks is that they are more parallelizable than RNNs, as the state at every timestep only depends on the local context (via the convolution operation or grid-like operation) rather than all past states as in the RNN.” — Sebastian Ruder
When CNN was combined with LSTM the training was faster better (Bradbury et al., 2017).
2014 — Sequence-to-sequence models
Sequence to sequence models was introduced by Ilya Sutskever et al. in 2014 with the intention to model end-to-end sequences. So far NLP was designed to model and predict the word in the sequence given the previous word. Ilya and co. wanted a framework that can take an input sequence and output another sentence (of the same or different length). These types of frameworks were thought to be useful in language to language translation (from English to Hindi or vice-versa). The challenge was to create a deep neural network that could model the input sequence of a fixed dimensionality and output of another dimensionality. So far deep neural networks could be applied to problems whose inputs and targets have the same length. The reason being is that one language (English) could take a larger number of words to convey a message while another language (Hindi) could take smaller or larger or even the number of words to convey the same message.
The sequence-to-sequence models address this type of problem by using two LSTM networks: one that encodes the input to a vector of a certain length and another to decode the same vector to the required length.
“The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed- dimensional vector representation, and then to use another LSTM to extract the output sequence from that vector. The second LSTM is essentially a recurrent neural network language model except that it is conditioned on the input sequence. The LSTMs ability to successfully learn on data with long-range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding outputs”- Sequence to Sequence Learning with Neural Networks
2015 — Attention
Earlier we saw how sequence-to-sequence models use encoder-decoder pairs to translate one language to another also known as machine translation. They work really well but the only issue is that their bottleneck produces compressed embeddings of the input that they receive. This compressed input does not capture the contextual information if the sentence is too long. And because of that, the order of the translated word could differ from their semantics and hence cause issues with the grammar. The performance of the language deteriorates rapidly as the length of an input sentence increases (Cho et al., 2016).
To address this problem Bahdanau et al. in 2015 introduced attention models which modify the sequence-to-sequence models by adding an attention layer. All it does is, it takes the current state values (h1, h2, … hn) at all the timestamps including the final state value (hf) at the end of the sequence value as well.
In essence, this architecture does not encode the whole input sequence to an embedding or fixed-sized input rather it encodes the input into a sequence of vectors and then assigning them with weight probabilities. These weight probabilities dictate how important a particular vector is. Remember that the importance of a particular vector lays a foundation of how much attention that vector should be given for translation.
Once the probabilities of the vectors are calculated it is then compressed into a context vector where the decoder can decode and then translate it accordingly.
“By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector. With this new approach, the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly”. Bahdanau et al., 2015
2017 — Transformers
So far we have seen the timeline of various algorithms and methods that helped us to model sequential data. Every model (may it be probabilistic or neural) tried to overcome the issue with the previous model. And till 2017 one issue that still remains with all the previously mentioned models was that they were not able to take into account long sequences of text, due to which they did not perform well. Although for short sentences they were good.
The way in which NLP models evolved brought them into a specific method called the ‘attention mechanism’ which we saw in the previous section. The attention mechanism was initially designed to enhance sequence-to-sequence architecture which was RNNs or LSTMs.
In 2017 Vaswani et al. introduced a new architecture called the transformers. The transformers do not use any RNNs or CNNs, but they do use the encoder-decoder mechanism which was introduced in sequence-to-sequence models. The transformers rely entirely on an attention mechanism to draw global dependencies between the input and the output. Transformers also turn out to be fast i.e. it takes less time to train because of its parallelization.
So far transformers have performed extremely well and have been able to produce language models like GPT-3, a model that almost passed the Turnings test.
We are not going to discuss the exact mechanism of how transformers work but we will see a quick overview of how it works.
As mentioned before transformers have an encoder and decoder. The encoder's job is to create a continuous learned representation of the input sequence. An encoder architecture contains two parts stacked on top of each other: self-attention and feed-forward network. The self-attention network’s job is to associate each individual work with every other word in the input sequence. To achieve self-attention the input vectors are fed into three linear layers to get a query, key, and value vectors.
The dot product of the first two vectors is calculated to get a score for each word in the input with the other word in the input which is then passed through a softmax function to get the weighted probability of each word in the input sequence. This helps the network to assign the importance of each word. The dot product of this output and value vector associates each value of the input with each other with respect to the weighted probability that was calculated earlier. The dot product of these two vectors drowns the words that are not important while highlights the words which are important. After which it is fed to the feed-forward network for point-wise operation.
This fully connected layer gives the model the power to extract high-quality representations which are fed to the decoder network.
The decoder network takes both the original outputs as well as the encoder outputs as inputs. The first attention layer of the decoder works the same way (as the encoder) while taking in the original output. The only difference is that while processing the current word is unaware of the future word.
The second attention layer takes in output from the encoder and the output from the first attention layer of the decoder and processes it i.e. deciding which encoder input is supposed to be focused on. The output of which is fed to the feed-forward network for the point-wise operation which is then fed into the final linear layer with a softmax function which outputs the final result.
What gives transformers the ability to remember long context is the self-attention mechanism.
One drawback of transformers is that it can only perform well when fed big data, for smaller data architectures like RNNs or LSTM is sufficient.
2018 — Pretrained Models
“Recently, substantial work has shown that pre-trained models (PTMs), on the large corpus can learn universal language representations, which are beneficial for downstream NLP tasks and can avoid training a new model from scratch.” — (Xipeng Qiu et al., 2020)
A downstream task is a supervised learning task that utilizes a pre-trained model.
There are pre-trained embeddings like skip-gram and GloVe likewise there are pre-trained NLP models as well like: BERT, GPT-2, Elmo, et cetera. The former can capture semantic meanings of words, they are context-free and fail to capture higher-level concepts in contexts, such as polysemous disambiguation, syntactic structures, semantic roles, and anaphora while the latter focuses on learning contextual word embeddings.
Most of the pre-trained models use transformer architecture. Pre-trained language models are able to learn with significantly fewer data. As language models only require unlabelled data, they are particularly beneficial for low-resource languages where labeled data is scarce or limited.
Research in NLP is opening new doors not only in the advancement of science and technology but also in different societies and cultures. These algorithms can able indigenous societies to get an education in their own language, breaking the barrier to education and social rights. It can also help us to explore more about our own cognition: the way we think and process information.
We take language to be a part of a system for understanding and communicating about situations. The human ability to understand and communicate about situations emerge gradually from experience and depends on domain-general principles of biological neural networks: connection-based learning, distributed representation, and context-sensitive, mutual constraint satisfaction-based processing... recent progress in this field depends on query-based attention, which extends the ability of these systems to exploit context and has contributed to remarkable breakthroughs — James L. McClelland et al., 2020
AI, in general, has taught us a lot about how our brains work and how it is able to understand the context and remember things in the long run.
The evolution that we saw is nothing more than resilience and perseverance of what humans can achieve, eventually leading us to human-level intelligence or passing Turing’s test.
AI Limits: Can Deep Learning Models Like BERT Ever Understand Language? - neptune.ai
It's safe to assume a topic can be considered mainstream when it is the basis for an opinion piece in the Guardian…
- A Review of the Neural History of Natural Language Processing
- AI Limits: Can Deep Learning Models Like BERT Ever Understand Language
- Attention? Attention! Lilian Weng
- The illustrated transformer
- NLP ImageNet
- Rumelhart, D., Hinton, G. & Williams, R in 1986.
- Word Embeddings: A Survey
- Word Embeddings and Their Use In Sentence Classification Task
- Efficient Estimation of Word Representations in Vector Space
- Sequence to Sequence Learning with Neural Networks
- Bahdanau et al., 2015
- Attention Is All You Need
- Pre-trained Models for Natural Language Processing: A Survey