All about Large Language Models

Atul Verma
6 min readApr 19, 2024

--

Photo by ilgmyzin on Unsplash

In the ever-evolving realm of artificial intelligence, Large Language Models (LLMs) are the buzzing core, shaping the discourse and innovation worldwide.

Large Language Models (LLMs) are sophisticated artificial intelligence systems built on deep learning architectures, notably the Transformer model. They are trained on vast datasets to understand and generate human-like text. Most popular example is OpenAI’s GPT (Generative Pre-trained Transformer) models, which excel in a wide range of natural language processing tasks, showcasing advanced capabilities in language understanding, generation, and contextual reasoning. LLMs have played a pivotal role in pushing the boundaries of AI applications, from chatbots and language translation to content creation and information retrieval.

What went wrong with RNNs?

Traditionally people used RNNs for their NLP tasks. To understand what went wrong with RNNs, we need to understand how RNNs work.

Recurrent Neural Networks (RNNs) are a type of artificial neural network designed for sequential data processing. Unlike traditional feedforward neural networks, RNNs have connections that form a directed cycle, allowing them to maintain a memory of previous inputs. This architecture enables RNNs to capture temporal dependencies and handle sequences of varying lengths.

RNN Architecture

However, RNNs have challenges with long-term dependencies, as they may struggle to retain information over many time steps. This limitation led to the development of more advanced models, such as Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs), which address the vanishing gradient problem and enhance the ability to capture long-range dependencies in sequential data.

Consider the following example, let’s have a network predicting next word for the sentence:

The milk is bad, my tea tastes _____

The RNN would probably predict ‘great’. This might be a good general case prediction, however it becomes terrible if the preceeding words affect the current prediction.

The milk is bad, my tea tastes ‘great’.

Here the context of ‘milk being bad’ should have influenced the prediction but it did not. This is one of the major drawbacks of RNNs of not being able to capture long range dependencies and contexts of other words in the sentence.

There are 3 major problems with LSTMs and RNNs:

  • Sequential computation inhibits parallelization
  • No explicit modeling of long and short range dependencies
  • “Distance” between positions is linear

For more info you can refer to the links in the references.

The Rise of Transformers

While LSTMs and GRUs provided significant improvements over vanilla RNNs, they still have some significant issues like computational complexity, memory consumption, huge training time and very limited parallelization.

The Transformer architecture emerged as a revolutionary solution to these challenges. Introduced by Vaswani et al. in the “Attention is All You Need” paper, Transformers rely on self-attention mechanisms to capture dependencies regardless of distance in a sequence.

Self Attention

Before we dive deep into Transformer architecture, its important to understand the self-attention mechanism

How self attention works?

Input Representation: Each element (e.g., word or token) in the sequence is associated with an embedding vector, capturing its semantic information.

Query, Key, and Value Vectors: The embedding vectors are linearly transformed into three sets of vectors: Query (Q), Key (K), and Value (V) vectors. These transformations are learned during the training process.

Attention Scores: For each element in the sequence, the model computes attention scores by taking the dot product of the Query vector of that element with the Key vectors of all elements in the sequence. These scores represent the relevance of other elements to the current one.

Scaled Attention Scores: The raw attention scores are scaled by the square root of the dimension of the Key vectors to prevent the scores from becoming too large.

Softmax and Weighted Sum: The scaled attention scores are passed through a softmax function to obtain normalized attention weights. These weights determine how much each element contributes to the final representation. The elements are then combined (weighted sum) using these attention weights to create a context-aware representation for each element.

The self-attention mechanism allows each element in a sequence to focus on different parts of the sequence, generating context-aware representations. For each position in the input sequence, self-attention computes weighted sums of all positions, assigning higher weights to more relevant positions. This way, the model can weigh the importance of different words dynamically based on the context.

Multi-Head Self Attention

Multi-Head Self-Attention is an extension of the self-attention mechanism in the Transformer architecture. Instead of relying on a single set of attention weights to capture relationships in the input sequence, multi-head attention uses multiple sets of weights in parallel.

Attention weights from multiple self attention heads are calculated and concatenated on the same input sequence concurrently.

The concatenated outputs are linearly transformed to produce the final multi-head self-attention output. This transformation combines the information captured by different heads and ensures compatibility with the model’s architecture.

The key advantage of Multi-Head Self-Attention is its ability to capture diverse relationships and patterns in parallel. Each attention head specializes in attending to different parts of the input sequence, enhancing the model’s capacity to understand complex dependencies.

The Transformer Architecture

Now that you have a decent understanding of the Self-Attention Mechanism. It’s time to understand The Transformer Architecture.

Transformer Architecture

Transformers primarily have 2 blocks Encoder and Decoder.

  • Encoder (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
  • Decoder (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

Each of these parts can be used independently, depending on the task:

  • Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition. Examples: BERT and RoBERTa
  • Decoder-only models: Good for generative tasks such as text generation. Examples: GPT and BLOOM
  • Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization. Examples: T5 and BART

For more info you can refer to the links in the references.

Generative configuration parameters for inference

Let’s now learn about some widely used configurations and terms used in Generative AI.

  • Prompt: The input to a language model. The prompt can be a question, an instruction, or any form of input that the model uses to generate text. [Refer learnprompting.org for a detailed guide to prompting]
  • Completion: It is the text generated by the language model in response to the provided prompt.
  • Token and Tokenization: In NLP, a token is a single unit of text, which could be a word, punctuation mark, or subword. Tokenization is the process of breaking down a piece of text into tokens. This process is essential for many NLP tasks because it allows the model to understand and process the text effectively.
  • Temperature: It controls the randomness of the generated text. High Temperature (T>1) makes the probability distribution flat. Thus, making less probable tokens more likely to be sampled. This results in more random or creative outputs. Similarly Low Temperature (0<T<1) makes the distribution sharper, concentrating the probability mass on the most likely tokens. This results in more deterministic outputs.
  • Top-k: Select an output from the top-k results after applying random-weighted strategy using the probabilities.
Top-k Sampling for k =3
  • Top-p: Select an output using the random-weighted strategy with the top-ranked consecutive results by probability and with a cumulative probability.
Top-p sampling for p=0.30

That’s all for today, it would be a really long article if I try to compile all the information about LLMs. We’ll talk more about LLM Pre-Training and Fine Tuning in the upcomming articles. Stay tuned.

Any suggestions and feedbacks are welcomed.

References

--

--

Atul Verma

"Zooming through life faster than a caffeinated cheetah. Blink and laugh!"