From Rule-Based Systems to Transformers: A Journey through the Evolution of Natural Language Processing

Seyed Saeid Masoumzadeh, PhD
7 min readJun 19, 2023

--

Introduction

Natural Language Processing (NLP) has evolved significantly over the years, driven by technological advancements, increased availability of large-scale data, and the development of more sophisticated algorithms and models. In this blog post, I will give you an overview of the key milestones and advancements in the evolution of NLP.

Rule-based Systems:

In the early days of NLP, rule-based systems were used to process and analyze text. These systems relied on handcrafted linguistic rules and patterns to extract information from a text but were limited in their ability to handle complex language structures.

A rule-based system example involves constructing a sentiment classifier for movie reviews. This system employs a set of rules to determine whether a review is positive, negative, or neutral. It utilizes if-then-else rules that encapsulate human knowledge regarding specific word types. Here are some example rules:

  1. If the review contains positive words like “great,” “excellent,” or “wonderful,” then classify it as positive.
  2. If the review contains negative words like “bad,” “awful,” or “terrible,” then classify it as negative.
  3. If the review mentions both positive and negative words, consider the overall sentiment based on the frequency and intensity of the words used.

Statistical Methods:

In the 1990s, statistical approaches gained popularity in NLP. These methods relied on large corpora of text to learn patterns and probabilities associated with different linguistic phenomena. Statistical models, such as Hidden Markov Models and n-gram models, were used for tasks like language modelling, part-of-speech tagging, and machine translation.

let me give you an example that how we can build a text-generative model using a statistical n-gram model. Consider the following sentence: “The cat sat on the mat.” To build an n-gram model, we break down the sentence into sequences of n consecutive words. Let’s use a bigram model, where n = 2. Here are the bigram derived from the sentence:

"The cat"
"cat sat"
"sat on"
"on the"
"the mat"
Refrence: Language Model Concept behind Word Suggestion Feature
Bigram generation: Reference (Language Model Concept behind Word Suggestion Feature)

Next, we calculate the probabilities of these bigrams based on their occurrence in a large training corpus. For example, if we have a corpus where “The cat” appears 50 times, “cat sat” appears 30 times, “sat on” appears 100 times, and “the mat” appears 70 times, we can calculate their probabilities accordingly.

After building this model now you can generate sentences by following these steps:

  1. Begin by selecting an initial word or phrase to serve as the starting point for sentence generation. For example “The Fox”.
  2. Utilize the n-gram probabilities to determine the most likely next word given the current context.
  3. Add the chosen word to the current context and shift the context window to include the latest word.
  4. Continuously repeat the process of selecting the next word based on the current context and expanding the context until a desired sentence length.
  5. Establish a termination condition to stop the sentence generation process.

By repeatedly selecting the most probable next word based on the n-gram model probabilities and expanding the context, the sentence generation process continues until the termination condition is met.

There are two significant weak points for statistical n-gram models:

  • Lack of Contextual Understanding: N-gram models only consider a fixed number of previous words as context. They do not capture long-range dependencies or have a deep understanding of the overall context.
  • Data Sparsity: As the size of n (the number of words in an n-gram) increases, the data sparsity problem becomes more pronounced. It becomes increasingly challenging to have sufficient training data for every possible n-gram combination.

Machine Learning:

With the rise of machine learning and neural networks, NLP entered a new era. Researchers started using neural networks to tackle various NLP tasks. Feedforward neural networks, recurrent neural networks (RNNs), and convolutional neural networks (CNNs) were applied to tasks like sentiment analysis, named entity recognition, and text classification.

Let’s discuss the text generation example but this time using machine learning techniques. I will choose LSTM (Long-Short Term Memory) for this task which is a type of recurrent neural network which processes the input sequentially.

  • We create input-output pairs, where the input sequence consists of a fixed number of previous words (which need to be encoded first), and the corresponding output sequence is the next word that follows. For example, if our input sequence length is 5, a training example would look like this:
Input: "The cat sat on the"
Output: "mat"
  • We build an LSTM-based neural network. The model consists of an input layer, an LSTM layer, a dropout layer (optional), a fully connected layer (as many as you want) and a dense output layer with softmax activation.
LSTM model architecture for text generation

LSTM outperforms the statistical n-gram model due to the following aspects:

  • LSTM captures dependencies between previous and current inputs, allowing it to learn long-term patterns. In contrast, statistical models, such as n-gram models, consider only a fixed context window of preceding words and rely on statistical probabilities without explicitly modelling sequential relationships.
  • LSTM networks learn parameters from data through an optimization process, typically using gradient-based methods. They adapt the model’s internal parameters based on the training data, allowing them to capture complex patterns and generalize well to unseen examples. Statistical models, on the other hand, rely on predefined rules or probability distributions derived from training data and lack the ability to adapt their parameters.

Transformers:

The advancements in Natural Language Processing (NLP) have heavily relied on addressing the challenge of capturing long-range dependencies. While machine learning models like recurrent and convolutional layers, with the inclusion of self-attention mechanisms, have made significant progress, they still face difficulties in effectively capturing these dependencies.

The “Attention is All You Need” paper was published in June 2017 by Vaswani et al. It introduced the Transformer architecture, which revolutionized Natural Language Processing (NLP). The paper proposed replacing all the previous Neural network architecture with a new architecture called Transformers which was able to capture long-range dependencies, the issue that the previous architectures were struggling with.

Transformer architecture

Transformers often use an encoder-decoder architecture for tasks like translation and summarization. However, the encoder and decoder parts can be used solely for some tasks. The architecture consists of the following components:

  • Positional encoding, which is employed to address the issue of word order.
  • Multi-head attention, enabling the model to simultaneously consider various dependencies.
  • Feed-forward networks, responsible for processing the outputs of attention and improving pattern recognition.

Now let's come back to our text generation example:

1. Tokenization: Tokenize the input text into individual words or subword units. Let’s assume the tokenized representation is: [‘the’, ‘cat’, ‘sat’, ‘on’, ‘the’, ‘mat’].

2. Vocabulary Creation: Build a vocabulary by assigning a unique index to each unique token in the tokenized data. This creates a mapping between tokens and their corresponding numerical representations.

3. Positional Encoding: Generate positional encodings to provide information about the position of each word in the sequence. Positional encoding helps the transformer model understand the order of words. One common method is to use sinusoidal functions to compute the positional encodings. The formula for positional encoding is:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where `pos` is the position of the word in the sequence, `i` is the dimension index, and `d_model` is the dimensionality of the transformer model.

4. Input Data Preparation: For each input sequence, combine the tokenized words with their corresponding positional encodings. The input sequence may look like this:

Input: ['the', 'cat', 'sat', 'on', 'the', 'mat']
Positional Encoding: [PE1, PE2, PE3, PE4, PE5, PE6]

5. Output Data Preparation: For text generation, the output data would be the same as the input data, shifted by one position. For example, the output sequence would be:

Output: ['cat', 'sat', 'on', 'the', 'mat', <end>]

`<end>` can be a special token indicating the end of the sequence.

By combining the tokenized words with their respective positional encodings and adjusting the output sequence accordingly, you can prepare the input and output data to feed into a transformer model for text generation. I also found a helpful video by Andrej Karpathy that provided valuable insights and guidance to write a text-generative model using transformers.

Pretraining and Large Language Models (LLMs):

The emergence of transformers has led to a surge in the popularity of large-scale pre-trained models. Techniques such as BERT (Bidirectional Encoder Representations from Transformers) involve training models on extensive text datasets, followed by fine-tuning for specific tasks. This transfer learning approach has proven effective in enabling models to acquire a broad understanding of language and achieve remarkable performance on various downstream tasks, even when training data is scarce. Consequently, open-source communities like Hugging Face have arisen, facilitating easy access and utilization of these models for a wide range of users.

OpenAI, an AI research laboratory, played a pivotal role in developing transformer-based language models, starting with GPT (Generative Pretrained Transformer). By feeding vast amounts of data to the model, it became capable of generating novel, human-like content. Subsequently, new generations of GPT models, such as GPT2, GPT3, and GPT4, were introduced, which offer safer and more useful responses while preserving their impressive generative capabilities.

--

--

Seyed Saeid Masoumzadeh, PhD

Highly accomplished Lead Data Scientist with a PhD in computer science and a proven track record of success in both academia and industry.