NLP Lineage (Need of Transformers)

Anmol Talwar
10 min readOct 20, 2023

--

In this article we will discuss the evolution of Natural Language Processing (NLP). Myriad NLP modelling techniques, their architecture & their limitations will be covered in the blog.

NLP Lineage

Bag of Words (BoW)

It is a simple and foundational technique used for text analysis and feature extraction. BoW represents a document as an unordered collection of words, ignoring the order and structure of the words in the text. It is a way to convert text data into a numerical format that can be used for various NLP tasks & modelling.

Here’s a simplified example of a BoW representation for three short documents:

Document 1 : “The quick brown fox”

Document 2 : “The lazy dog”

Document 3 : “The fox jumps over the dog”

Vocabulary: [“The”, “quick”, “brown”, “fox”, “lazy”, “dog”, “jumps”, “over”]

The BoW representations of these documents might look like this:

Document 1 : [1, 1, 1, 1, 0, 0, 0, 0]

Document 2 : [1, 0, 0, 0, 1, 1, 0, 0]

Document 3 : [1, 0, 0, 1, 0, 1, 1, 1]

These representations (in the numerical form) can now be used as an input for modelling tasks such as document classification, sentiment analysis, or information retrieval.

BoW Limitations

  • Loss of Word Order : Doesn’t capture the order or context of words
  • Equal Importance of Words : Treats all words as equally important. Common words (like “the,” “and,” “is”) receive the same weight as more informative or meaningful words. Important words may be overshadowed by frequent but less important ones.
  • No Semantics : Words with different meanings but identical spellings are treated the same way. For example, “bat” as a flying animal and “bat” as a sports equipment would have the same representation.
  • Fixed Vocabulary : Relies on a fixed vocabulary created from the training corpus. This can lead to issues when dealing with out-of-vocabulary words i.e., words apart from the training corpus such as domain-specific terminology.
  • High Dimensionality : Can result in high-dimensional feature vectors, especially when the corpus is large or diverse. This can lead to computational challenges and increased memory requirements.
  • Sparsity : BoW matrix are typically sparse because most words are absent from any given document. Sparse vectors can lead to inefficiency in model computations.

Term Frequency-Inverse Document Frequency (TF-IDF)

The goal of TF-IDF is to quantify how uniquely important a word of interest is to a given document within a collection of documents. The meaning increases proportionally to the number of times in the text a
word appears but is compensated by the word frequency in the corpus data-set.

Lets understand it using a small corpus of three documents:

Document 1 : “The cat sat on the mat.”

Document 2 : “The quick brown fox jumps over the lazy dog.”

Document 3 : “The cat and the fox are friends.”

Now, we want to calculate the TF-IDF score for the word “cat” in first Document.

Term Frequency (TF)

Term Frequency measures how often a word appears in a document. It is calculated as the number of times the word appears in the document divided by the total number of words in the document.

For Document 1, the term frequency of “cat” is

  • TF(“cat”, Document 1) = (Number of times “cat” appears in Document 1) / (Total number of words in Document 1)
  • TF(“cat”, Document 1) = 1 / 7 (since “cat” appears once in Document 1, and there are 7 words in Document 1)
  • TF(“cat”, Document 1) ≈ 0.1429

Inverse Document Frequency (IDF)

Inverse Document Frequency measures how unique or important a word is across the entire corpus. It is calculated as the logarithm of the total number of documents divided by the number of documents containing the word, plus one (to avoid division by zero).

For “cat” in the entire corpus (three documents)

  • IDF(“cat”) = log((Total number of documents in the corpus) / (Number of documents containing “cat”))
  • IDF(“cat”) = log(3 / 2) (since “cat” appears in Documents 1 and 3)
  • IDF(“cat”) ≈ 0.4055

TF-IDF Score

The TF-IDF score for “cat” in Document 1 is obtained by multiplying the TF and IDF values

  • TF-IDF(“cat”, Document 1) = TF(“cat”, Document 1) * IDF(“cat”)
  • TF-IDF(“cat”, Document 1) ≈ 0.1429 * 0.4055
  • TF-IDF(“cat”, Document 1) ≈ 0.0581

This TF-IDF score indicates that the word “cat” is relatively important in Document 1 compared to the entire corpus, but it’s not as important as other words that may have higher TF-IDF scores.

TF-IDF Limitations

Though it overcomes some of the limitations of BoW such as Loss of Word order & Equal importance of words but is still left with a few below

  • No Semantics
  • Fixed Vocabulary
  • High Dimensionality
  • Sparsity

Recurrent Neural Network

A Recurrent Neural Network (RNN) is a type of neural network designed for processing sequential data, such as time series, sequences of words, or any data where the order of elements matters.

In traditional neural networks, all the inputs and outputs are independent of each other, but in cases when it is required to predict the next word of a sentence, the previous words are required and hence there is a need to remember the previous words. RNNs have a recurrent structure that allows them to maintain hidden states and process input data one element at a time, while also considering the influence of previous elements in the sequence.

Same Weight distributed across RNN

RNNs have the same input and output architecture as any other deep neural architecture. However, differences arise in the way information flows from input to output. Unlike Deep neural networks where we have different weight matrices for each Dense network in RNN, the weight across the network remains the same. It calculates the current hidden state based on the pervious hidden state and current input.

RNN Unfolded

Flow of Data in RNN

  • The RNN processes the input sequence one element at a time, starting with the first element and proceeding sequentially through to the last element.
  • It computes a new hidden state, which is updated using a combination of the current input (e.g., a word vector) and the previous hidden state.
  • This hidden state acts as a memory of the sequence and contains information about the elements seen so far.
  • One can go as many time steps according to the problem and join the information from all the previous states.
  • Once all the time steps are completed the final current state is used to calculate the output. The predicted output it is then compared to the target output and the error is generated.
  • The error is then back-propagated to the network to update the weights and hence the network (RNN) is trained using Backpropagation through iterations.

RNN Limitations

** In most of the Deep learning approaches, we use word embeddings such Word2Vec, Glove embedding, etc. for encoding the sentences. These embeddings convert the words into high dimensional vectors, words having same meaning have similar vectors. Also these embeddings have been pre-trained on large corpus of data.

These traits of embeddings introduce context for better processing the data, solving the problem of No semantics in TF-IDF. Their pre-training on large corpus also solves the Fixed Vocabulary problem. **

  • Vanishing Gradient : While weight sharing in RNNs has many advantages, it also introduces challenges, such as the vanishing gradient problem (when backpropagating errors through many time steps the gradients can become very small), which can limit the network’s ability to capture long-range dependencies in the data.
  • Exploding Gradient : RNN also suffer from the exploding gradient problem, where gradients grow exponentially, leading to instability during training.
  • Lack of Memory : Standard RNNs have limited memory because the hidden state at each time step carries information from previous time steps. As the sequence grows longer, the network’s ability to capture important information diminishes.
  • Difficulty with Variable-Length Sequences : RNNs require fixed-length sequences, which can be problematic when working with variable-length sequences, such as sentences of different lengths.
  • Training Time : RNNs can be computationally expensive and time-consuming to train, especially when dealing with large datasets and deep networks.
  • Limited Parallelism : RNNs process sequences sequentially, which makes them less efficient for parallel processing.

Long-Short Term Memory (LSTM)

Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture designed to address the vanishing gradient problem and capture long-range dependencies in sequential data. LSTMs use a more sophisticated internal structure compared to traditional RNNs, allowing them to maintain and control information over extended sequences

LSTM Architecture

The central idea behind LSTMs is the use of a cell state, which serves as a conveyor belt to carry information across different time steps. The cell state can be thought of as a long-term memory that can store or discard information. The cell state is modified and updated at each time step through a combination of the forget gate (which determines what to forget), the input gate (which decides what new information to add), and the output gate. This allows the LSTM to capture and remember important information from earlier time steps.

Cell State-LSTM

LSTMs use a cell state as a long-term memory that can store and update information over time and the hidden state is the output at each time step, representing the current context of the sequence. Whereas in standard RNNs, the hidden state serves as both the short-term and long-term memory, which limits their ability to capture and remember long-range dependencies effectively.

LSTM Limitations

  • Difficulty with Extremely Long Sequences : While LSTMs are designed to capture long-range dependencies, they can still struggle with sequences that are exceptionally long, and the gradients may still vanish or explode under certain conditions.
  • Limited Understanding of Context : LSTMs, like other sequence models, may not fully grasp the semantics and context of the data. They learn patterns based on the available training data but may not have an inherent understanding of the content they process.
  • Limited Parallelism : LSTMs process sequential data sequentially, one element at a time, which limits their parallelism.
  • Complexity : LSTMs are more complex than traditional RNNs, which can make them computationally expensive and require more training data to generalize effectively. This complexity can also lead to overfitting, especially when dealing with small datasets.
  • Difficulty in Interpreting Results : The inner workings of an LSTM, especially when it comes to the gate mechanisms and cell state, can be challenging to interpret and understand. This can make it difficult to diagnose and debug issues in the model.

Transformers

Transformers have gained prominence in natural language processing (NLP) and various other sequence-related tasks.

At the heart of the Transformer architecture is the self-attention mechanism. This mechanism allows the model to weigh the importance of each element in the input sequence concerning all the other elements. It calculates a weighted sum of the input elements, which captures their dependencies and relationships in the sequence.

Other key features include multi-head attention, encoder-decoder architecture, positional encoding, stacking layers, and pre-training with fine-tuning. Transformers have achieved state-of-the-art results in NLP and are versatile for various tasks. They are commonly used in applications like text classification, machine translation, question answering, and text generation.

To know more about Transformer Architecture, go through my blog

They offer several advantages over traditional LSTM networks such as:

  • Parallelization : They can process input sequences in parallel, making them significantly faster than LSTMs. This parallelism allows for efficient use of modern hardware and accelerators, like GPUs and TPUs, leading to faster training and inference times.
  • Capturing Long-Range Dependencies : The self-attention mechanism allows them to relate any two positions in the input sequence, making them suitable for tasks where understanding context and dependencies across the entire sequence is essential.
  • Scalability : Transformers are highly scalable. They can be effectively used with both small and large datasets and can be trained on vast amounts of data, leading to state-of-the-art performance in many NLP tasks.
  • Generalization : Transformers generalize well across domains. Pre-trained transformer models, such as BERT, GPT, and RoBERTa, can be fine-tuned on domain-specific tasks with limited data to achieve strong performance.
  • Interpretability : Transformers offer better interpretability. The attention weights in the self-attention mechanism allow for understanding which parts of the input sequence are relevant to each other, providing more transparency in the model’s decision-making.
  • Memory Efficiency : Transformers can handle longer sequences more effectively due to their self-attention mechanism, which is not constrained by fixed context windows as in LSTMs.
  • Pre-trained Models : Pre-trained transformer models are readily available for various NLP tasks. They provide a strong starting point for fine-tuning on specific tasks, significantly reducing the amount of training data and time required.
  • State-of-the-Art Performance : Transformers have achieved state-of-the-art performance in a wide range of NLP tasks, including text classification, machine translation, question answering, and text generation. They have outperformed LSTMs in many benchmarks.
  • Attention to Relevance : The self-attention mechanism allows Transformers to dynamically attend to the most relevant parts of the input sequence. This flexibility helps them focus on important information and disregard noise, which is a significant advantage in tasks like language understanding and generation.
  • Better Handling of Multiple Modalities : Transformers can naturally handle multiple modalities, such as text and images, by processing them together in a multi-modal architecture. LSTMs are less effective at handling such multi-modal data.

To learn more about Type of Transformer Architectures, do go through my blog

--

--