Transformers in Large Language Model

Sarath Samynathan
Version 1
Published in
6 min readJul 19, 2024

OpenAI’s GPT (Generative Pre-trained Transformer) has rapidly captured global attention, massing over a million users due to its revolutionary technology. With vast applications in various fields, many people are curious about how GPT works. This blog post aims to demystify GPT and provide a clear understanding of its underlying mechanics.

What Are Large Language Models (LLMs)?

Large Language Models (LLMs) are trained on extensive amounts of text data, enabling them to generate coherent and fluent text. These models excel at various natural language processing tasks such as language translation, text summarization, and conversational agents. Their effectiveness is due to pre-training on vast corpus of text data and the ability to be fine-tuned for specific tasks. GPT is an example of an LLM, characterized as “large” due to the billions of parameters it contains. For instance, GPT-3, the most advanced version, has 175 billion parameters and was trained on a massive corpus of text data.

The basic premise of a language model is its ability to predict the next word or sub-word (tokens) based on the text it has seen so far. This is achieved by assigning probabilities to tokens, with the highest probability token typically chosen as the next part of the input. This process repeats until a special <stop> token is selected.

The deep learning architecture that has made this process more human-like is the Transformer architecture. So let us now briefly understand the Transformer architecture.

The Transformer Architecture: The Building Block

The transformative power of GPT lies in its foundation: the Transformer architecture. Introduced in the 2017 paper “Attention is All You Need,” the Transformer architecture has fundamentally changed how language models are built. The simplified version of the Transformer architecture looks like this

There are seven important components in transformer architecture. Let’s go through each of these components and understand what they do in a simplified manner:

  1. Inputs and Input Embeddings: Tokens entered by the user are the model’s inputs. Since models understand numbers, not text, these inputs are converted into numerical representations called “input embeddings.” Input embeddings represent words as numbers, allowing the model to process them. These embeddings form a dictionary that helps the model understand word meanings by placing them in a mathematical space where similar words are near each other.

2. Positional Encoding: The order of words in a sentence is crucial for understanding its meaning. Traditional machine learning models do not inherently grasp this order. Positional encoding addresses this by encoding each word’s position in the input sequence numerically. This encoding, combined with input embeddings, helps the Transformer architecture understand word order, enhancing the model’s ability to generate grammatically correct and meaningful output.

3. Encoder: The encoder processes the input text, generating hidden states that capture its meaning and context. It tokenizes the input text into sequences of tokens and applies self-attention layers to produce hidden states representing the text at different abstraction levels. Multiple encoder layers are used in the Transformer.

4. Outputs (Shifted Right): During training, the decoder learns to predict the next word by looking at the previous words. The output sequence is shifted one spot to the right, allowing the decoder to use only preceding words. GPT is trained on extensive text data, which helps it generate coherent text. Training data includes Common Crawl web corpus, BooksCorpus dataset, and English Wikipedia, providing a vast language dataset for learning.

5. Output Embeddings: Similar to input embeddings, the output must be converted to a numerical format called “output embeddings.” These embeddings also undergo positional encoding. A loss function measures the difference between model predictions and actual target values, adjusting the model to improve accuracy. Output embeddings are used during training to compute the loss function and during inference to generate output text.

6. Decoder: During training, the decoder learns to predict the next word by looking at the previous words. The output sequence is shifted one spot to the right, allowing the decoder to use only preceding words. GPT is trained on extensive text data, which helps it generate coherent text. Training data includes Common Crawl web corpus, BooksCorpus dataset, and English Wikipedia, providing a vast language dataset for learning

The below image illustrates the concept of the encoder-decoder architecture using a simple analogy involving two players.

7 .Linear Layer and Softmax: After producing the output embeddings, the linear layer maps them to a higher-dimensional space. The softmax function then generates a probability distribution for each token in the vocabulary, enabling the generation of output tokens with probabilities.

The Concept of Attention Mechanism

Attention is all you need.

The transformer architecture beats out other ones like Recurrent Neural networks (RNNs) or Long short-term memory (LSTMs) for natural language processing. The reason for the superior performance is mainly because of the “attention mechanism” concept that the transformer uses. The attention mechanism lets the model focus on different parts of the input sequence when making each output token.

  • The RNNs don’t bother with an attention mechanism. Instead, they just plow through the input one word at a time. On the other hand, Transformers can handle the whole input simultaneously. Handling the entire input sequence, all at once, means Transformers do the job faster and can handle more complicated connections between words in the input sequence.
  • LSTMs use a hidden state to remember what happened in the past. Still, they can struggle to learn when there are too many layers (a.k.a. the vanishing gradient problem). Meanwhile, Transformers perform better because they can look at all the input and output words simultaneously and figure out how they’re related (thanks to their fancy attention mechanism). Thanks to the attention mechanism, they’re really good at understanding long-term connections between words.

Summary

  1. Selective Focus: The model can prioritize crucial parts of the input, improving text generation accuracy.
  2. Long-Term Dependencies: It captures relationships between distant words, essential for context understanding.
  3. Parameter Efficiency: Fewer parameters are needed for long-term dependencies, making the model efficient.
  4. Handling Variable Lengths: The model adjusts attention based on sequence length, processing different input lengths effectively.

Conclusion

This blog post provides an introduction to Large Language Models (LLMs) and the Transformer Architecture powering models like GPT. LLMs, pre-trained on massive text corpus, have revolutionized natural language processing. The Transformer architecture, with its attention mechanism, enables models like GPT to generate accurate and contextually relevant text. With capabilities in text generation, summarization, and question-answering, LLMs are unlocking new possibilities for communication and human-machine interaction. In summary, LLMs have significantly advanced natural language processing, enhancing human-machine interaction in exciting ways.

About the Author

Sarath Samyanath is an AI Data Scientist at Version 1.

--

--