A Natural Transition from “Bag of words” to “Transformers” — A Recap of NLP’s Journey So Far

Sojan George
6 min readJun 20, 2020

--

Natural Language Processing (NLP) has made incredible strides over the last decade. However, we are far from achieving a human like inference by machines of natural language, which has often been taunted as the final step to achieving Artificial General Intelligence (AGI).

The latest techniques used in this field, however, were not built over-night. It was a result of natural progression of the NLP technology. There were multiple approaches that were the “latest, break-through” technologies at some point of time. But as time passed by and with more and more data/cheap compute power becoming available for use, these approaches became inadequate.

This article aims to highlight how NLP evolved over the last decade and provide a high-level view of the challenges faced by each technique and how a slow and steady evolution of technology brought us to where we are today.

Information Gain — At Each Step

Unlike, structured information, gaining insights from unstructured data like textual information is not a straightforward task. There is a need to convert textual information to matrices/numbers so that machines can gain some insights. In this conversion itself, a lot of information is lost. The overall goal is to find a method that does this translation with minimal loss of information.

Today’s go-to techniques in the field of NLP leverages attention-based models and transformers. The diagram (Figure 1) below highlights the different approaches that played a significant role in arriving to our current state of NLP.

Figure 1: Evolution of Natural Language Processing

Each approach had its significance as it provided a significant information gain compared to the then existing norm. However, in-spite of this, each approach had some gaps that created a need for a different approach. Each approach, built on what was built earlier & tried to overcome the challenges from its predecessor

From Bag-of-words to Attentions Models

NLP mainly focuses on the branch of AI that aims on making machines understand textual information. So naturally the first step of this journey (Figure 2), was to convert the textual information into numbers or matrices or vectors so that machines can make sense of the underlying data. As shown in the diagram, each step focused on bringing some aspect that helped in bringing an information/performance gain compared to the previous step.

Figure 2: The Building Blocks Of Today’s NLP

· Bag-of-words:Converted textual information to numerical form“

That’s exactly what Bag of words method did. It converted the words/tokens in a sentence into numbers/matrices using different techniques like count vectorization or tfidf or n-grams. Couple this with cosine similarity, the NLP community was able to make significant progress in gaining insights from textual information. However, the one major drawback of this technique was that it gave no importance to the underlying meaning of the words used nor the context.

· Word Embedding’s: “Extracted meaning from Represented words”

Word embedding techniques attempted to extract the meaning behind the words used in the sentence. This method, basically aimed to represent every word in the vocabulary in a n-dimensionality space, such that, similar word will appear closer to one another. Word2vec and GloVe are common models used to create vector representation of words. This is similar to how computer vision techniques extract n features of the face for facial recognition and compare them to validate the different users. This was a significant step in the NLP community, and many believed this was the inflection point for NLP as it allowed NLP to “transfer learning” from one use case to another. This was not previously possible. However, one major drawback of this method was that it ignored the positional information of the words.

· RNN: “Extracted positional information”

RNN based models gave significance to the positional information as well. Moreover, it was able to handle use cases where input and output were of different length which is required for language translation. It leveraged a recurrent neural network for this. As with any RNN, each word in a sentence in predicted not only based on the current input but also based on prior inputs. However, one drawback here, was that this technique didn’t do well in long sentences. The reason was that prior information gets diluted as the sentences gets longer.

· LSTM Based Models: “Solved the problem of diminishing gradients”

LSTM models helped to overcome the problem of diminishing gradients as it had a mechanism to forget irrelevant information and take only relevant information to deeper layers. They used gates for this purpose that enabled each layer to decide was information needed to be kept and what information can be discarded. While LSTM solved the vanishing gradient problem, it still had a drawback as this technique used only prior information for prediction. However, an in many NLP based tasks (like translation), in-order to make a good judgement, it is important to have information of words used later as well.

· Bi-Directional LSTM: “Considered the complete sentence in order to predict”

Bi-directional RNN or bi-directional LSTM models worked very similar to LSTM based models. The only difference was that it took the complete text (both past and future text) into account in-order to predict the present word. These models have both a forward recurrent component and a backward recurrent component. One major disadvantage of this technique was that it required a completed sequence of data to make in a prediction. However, human do not work like this. For example, if a text was given to be translated to another language, human need not hear the complete text in-order to start translation. Rather, after hearing a substantial part of the text, a human can give different attention to the words already heard & can make the translation with a certain degree of confidence.

· Attention based models: “Ability to focus on relevant input via attention weights”

These models differ from a bi-directional RNN/LSTM model as these models looks at an input sequence and decides at each step which other part of the sequence is important or needs attention. So apart from the forward recurrent component, the backward recurrent component, hidden state & previous output, the model also considers the weightage of the words surrounding the current context. This weightage, called attention weights, helped to give relevant weight to different parts of the input. While this method helped improve the accuracy and perform more human like, it faced a challenge in terms of performance. As computation was done sequentially, it was difficult to scale up this solution in practical applications. The sequential nature of the model architecture prevented parallelization.

· Transformers: “Parallelized the processing of sequential data for better performance”

Transformers differed from a sequence to sequence model as it did not use a RNN network but rather focused on leveraging attention mechanism. Attention mechanism allowed transformers to process all elements simultaneously by forming direct connections between individual elements. Not only did it enable parallelization, but it also results in a higher degree of accuracy across a range of tasks. In order to remember the sequence of input, the positions of each word were embedded into the representation. However, even transformers had its limitations. The attention often dealt with a fixed-length text strings, which caused an issue of context fragmentation. However, this limitation was overcome via a modified architecture call transformer-XL, when hidden states obtained from previous segments were reused as input in the current segment. There are different implementations of transformers, like BERT and ALBERT which combines current trends like attention based transformers and transfer learning to get superior results.

Conclusion:

So, are we done with this journey? Absolutely not. In-fact we are far from it.

Can today’s NLP solution comprehend all available information and respond human-like? Unfortunately, no.

Today’s NLP solutions are limited by their domain awareness and context awareness. Solutions are often heavily driven by the domain. Ideally, user intents need to drive the domain and context of a solution. Humans are very good at this. These challenges heavily limits the usage in real world problems. For NLP solutions to be more human-like, we need to overcome this barrier and perhaps might be the next step in this natural progression. Most of today’s NLU or NLG solutions are far from ideal and works only in controlled environment. NLP’s journey is certainly not over and moving forward would require us to challenge and change many assumptions of today.

--

--

Sojan George

IT Professional working in the Artificial Intelligence domain.