A Tutorial on LLM

15 min readSep 14, 2023

Generative artificial intelligence (GenAI), especially ChatGPT, captures everyone’s attention. The transformer based large language models (LLMs), trained on a vast quantity of unlabeled data at scale, demonstrate the ability to generalize to many different tasks. To understand why LLMs are so powerful, we will deep dive into how they work in this post.

Formally, a decoder only language model is simply a conditional distribution p(xi|x1···xi−1) over next tokens xi given contexts x1 · · · xi−1. Such a formulation is an example of Markov process, which has been studied in many use cases. This simple setup also allows us to generate token by token in an autoregressive way.

Before our deep dive, I have to call out the limitation of this formulation to reach artificial general intelligence (AGI). Thinking is a non-linear process but our communication device, mouth, can speak only linearly. Therefore, language appears a linear sequence of words. It is a reasonable start to model language with a Markov process. It is hard to believe that such an formulation can capture the thinking process (or AGI) completely. On the other hand, thinking and language are interrelated. A strong enough language model may still demonstrates some sort of thinking capability as GPT4 shows. In what follows, let’s check out the scientific innovations that makes LLMs to appear intelligently.

Transformer

There are many ways to model/represent the conditional distribution p(xi|x1···xi−1). In LLMs, we attempt to estimate this conditional distribution with a neural network architecture called Transformer. In fact, neural networks, especially a variety of recurrent neural networks (RNNs), have been employed in language modeling for long time before Transformer. RNNs process tokens sequentially, maintaining a state vector that contains a representation of the data seen prior to the current token. To process the n-th token, the model combines the state representing the sentence up to token n-1with the information of the new token to create a new state, representing the sentence up to token n. Theoretically, the information from one token can propagate arbitrarily far down the sequence, if at every point the state continues to encode contextual information about the token. Unfortunately, the vanishing gradient problem leaves the model’s state at the end of a long sentence without precise, extractable information about preceding tokens. The dependency of token computations on the results of previous token computations also makes it hard to parallelize computation on modern GPU hardware.

These problems were addressed by self-attention mechanisms in Transformer. Transformer is a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The attention layer can access all previous states and weigh them according to a learned measure of relevance, providing relevant information about far-away tokens. Importantly, Transformers use an attention mechanism without an RNN, processing all tokens simultaneously and calculating attention weights between them in successive layers. Since the attention mechanism only uses information about other tokens from lower layers, it can be computed for all tokens in parallel, which leads to improved training speed.

The input text is parsed into tokens by a byte pair tokenizer, and each token is converted into an embedding vector. Then, positional information of the token is added to the embedding. The transformer building blocks are scaled dot-product attention units. When a sentence is passed into a transformer model, attention weights are calculated between every token simultaneously. The attention unit produces embeddings for every token in context that contain information about the token itself along with a weighted combination of other relevant tokens each weighted by its attention weight.

For each attention unit, the transformer model learns three weight matrices; the query weights WQ, the key weights WK, and the value weights WV. For each token i, the input word embedding is multiplied with each of the three weight matrices to produce a query vector qi, a key vector ki, and a value vector vi. Attention weights are dot product between qi and kj, scaled by the square root of the dimension of the key vectors, and normalized through softmax. The output of the attention unit for token i is the weighted sum of the value vectors of all tokens, weighted by the attention from token i to each token j. The attention calculation for all tokens can be expressed as one large matrix calculation:

One set of (WQ, WK, WV) matrices is called an attention head, and each layer of transformer has multiple attention heads. With multiple attention heads the model can calculate different relevance between tokens. The computations for each attention head can be performed in parallel and the outputs are concatenated and projected back to same input dimension by a matrix WO.

In an encoder, there is a fully-connected multilayer perceptron (MLP) after the self-attention mechanism. The MLP block further processes each output encoding individually. In the encoder-decoder setting (e.g. for translation), an additional attention mechanism is inserted between self-attention and MLP into the decoder to draw relevant information from the encodings generated by the encoders. In a decoder only architecture, this is not necessary. No matter encoder-decoder or decoder only architecture, decoder must not use the current or future output to predict an output, so the output sequence must be partially masked to prevent this reverse information flow, which allows for autoregressive text generation. To generate token by token, the last decoder is followed by a softmax layer to produce the output probabilities over the vocabulary.

Supervised Fine-Tuning

Decoder-only GPT is essentially a unsupervised (or self-supervised) pre-training algorithm that maximizes the following likelihood:

where k is the size of context window. While the architecture is task-agnostic, GPT demonstrates that large gains on natural language inference, question answering, semantic similarity, and text classification can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.

After pre-training the model with the above objective, we can adapt the parameters to the supervised target task. Given a labeled dataset C, where each instance consists of a sequence of input tokens, x1, . . . , xm, along with a label y. The inputs are passed through the pre-trained model to obtain the final transformer block’s activation hlm, which is then fed into an added linear output layer with parameters Wy to predict y:

Correspondingly, we have the following objective function:

In addition, it is helpful including language modeling as an auxiliary objective as it improves generalization of the supervised model and accelerates convergence. That is, we optimize the following objective:

Text classification can be directly fine-tuned as described above. Other tasks, like question answering or textual entailment, have structured inputs such as ordered sentence pairs, or triplets of document, question, and answers. Since the pre-trained model was trained on contiguous sequences of text, it needs some modifications to apply to these tasks.

Textual entailment: concatenate the premise p and hypothesis h token sequences, with a delimiter token ($) in between.
Similarity: there is no inherent ordering of the two sentences being compared. Therefore, the input sequence contain both possible sentence orderings (with a delimiter in between) and process each independently to produce two sequence representations, which are added element-wise before being fed into the linear output layer.
Question Answering and Commonsense Reasoning: each sample has a context document z, a question q, and a set of possible answers {ak}. GPT concatenates the document context and question with each possible answer, adding a delimiter token in between to get [z;q;$;ak]. Each of these sequences are processed independently and then normalized via a softmax layer to produce an output distribution over possible answers.

Zero-Shot Transfer (aka Meta Learning)

While GPT shows that supervised fine-tuning works well on task specific datasets, to achieve strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands of examples specific to that task. Interestingly, GPT2 demonstrates that language models begin to learn multiple tasks without any explicit supervision, conditioned on a document plus questions (aka prompts).

Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distribution p(output|input). Since a general system should be able to perform many different tasks, even for the same input, it should condition not only on the input but also on the task to be performed. That is, it should model p(output|input, task). Previously, task conditioning is often implemented at an architectural level or at an algorithmic level. But language provides a flexible way to specify tasks, inputs, and outputs all as a sequence of symbols. For example, a translation training example can be written as the sequence (translate to french, english text, french text). In particular, GPT2 is conditioned on a context of example pairs of the format english sentence = French sentence and then after a final prompt of english sentence = we sample from the model with greedy decoding and use the first generated sentence as the translation.

Similarly, to induce summarization behavior, GPT2 adds the text TL;DR: after the article and generate 100 tokens with Top-k random sampling with k = 2 which reduces repetition and encourages more abstractive summaries than greedy decoding. Likewise, a reading comprehension training example can be written as (answer the question, document, question, answer).

Note that zero-shot transfer is different from zero-shot learning in next section. In zero-shot transfer, “zero-shot” is in the sense that no gradient updates are performed, but it often involves providing inference-time demonstrations to the model (e.g. the above translation example), so is not truly learning from zero examples.

I find an interesting connection between this meta learning approach with Montague semantics, which is a theory of natural language semantics and of its relationship with syntax. In 1970, Montague formulated his views:

There is in my opinion no important theoretical difference between natural languages and the artificial languages of logicians; indeed I consider it possible to comprehend the syntax and semantics of both kinds of languages with a single natural and mathematically precise theory.

Philosophically, both zero-shot transfer and Montague semantics treat natural language same as programming language. LLMs capture the task through the embedding vectors in a black box approach. It is not clear to us how it really works though. In contrast, the most important features of Montague semantics are its adherence to the principle of compositionality — that is, the meaning of the whole is a function of the meanings of its parts and their mode of syntactic combination. This may be an approach to improve LLMs.

In Context Learning

GPT3 shows that scaling up language models greatly improves task-agnostic, few-shot performance. GPT3 further specialize the description to “zero-shot”, “one-shot”, or “few-shot” depending on how many demonstrations are provided at inference time: (a) “few-shot learning”, or in-context learning where we allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, where we allow only one demonstration, and © “zero-shot” learning, where no demonstrations are allowed and only an instruction in natural language is given to the model.

For few-shot learning, GPT3 evaluates each example in the evaluation set by randomly drawing K examples from that task’s training set as conditioning, delimited by 1 or 2 newlines depending on the task. K can be any value from 0 to the maximum amount allowed by the model’s context window, which is nctx = 2048 for all models and typically fits 10 to 100 examples. Larger values of K are usually but not always better.

For some tasks GPT3 also uses a natural language prompt in addition to (or for K = 0, instead of) demonstrations. On tasks that involve choosing one correct completion from several options (multiple choice), the prompt includes K examples of context plus correct completion, followed by one example of context only, and the evaluation process compares the model likelihood of each completion.

On tasks that involve binary classification, GPT3 gives the options more semantically meaningful names (e.g. “True” or “False” rather than 0 or 1) and then treat the task like multiple choice.

On tasks with free-form completion, GPT3 uses beam search. The evaluation process scores the model using F1 similarity score, BLEU, or exact match, depending on what is standard for the dataset at hand.

Model Size Matters (So Far)

The capacity of the language model is essential to the success of task-agnostic learning and increasing it improves performance in a log-linear fashion across tasks. GPT-2 was created as a direct scale-up of GPT-1, with both its parameter count and dataset size increased by a factor of 10. But it can perform downstream tasks in a zero-shot transfer setting — without any parameter or architecture modification.

GPT3 uses the same model and architecture as GPT2 with the exception using alternating dense and locally banded sparse attention patterns in the layers of the transformer.

On TriviaQA, GPT3’s performance grows smoothly with model size, suggesting that language models continue to absorb knowledge as their capacity increases. One-shot and few-shot performance make significant gains over zero-shot behavior.

Data Quality Matters

While less discussed, data quality matters too. Datasets for language models have rapidly expanded. For example, the CommonCrawl dataset constitutes nearly a trillion words, which is sufficient to train largest models without ever updating on the same sequence twice. However, it was found that unfiltered or lightly filtered versions of CommonCrawl tend to have lower quality than more curated datasets.

Therefore, GPT2 created a new web scrape which emphasizes document quality by scraping all outbound links from Reddit which received at least 3 karma, which acts as a heuristic indicator for whether other users found the link interesting, educational, or just funny. The final dataset contains slightly over 8 million documents for a total of 40 GB of text after de-duplication and some heuristic based cleaning.

Further, GPT3 took 3 steps to improve the average quality of datasets: (1) filtered CommonCrawl based on similarity to a range of high-quality reference corpora, (2) fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of held-out validation set as an accurate measure of overfitting, and (3) added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.

Similarly, GLaM develops a text quality classifier to produce a high-quality web corpus out of an original larger raw corpus. This classifier is trained to classify between a collection of curated text (Wikipedia, books and a few selected web-sites) and other webpages. GLaM uses this classifier to estimate the content quality of a webpage and then uses a Pareto distribution to sample webpages according to their score. This allows some lower-quality webpages to be included to prevent systematic biases in the classifier.

Data and mixture weights in GLaM training set

GLaM also sets the mixture weights based on the performance of each data component in a smaller model and to prevent small sources such as Wikipedia from being over-sampled.

Chain of Thought

As pointing out earlier, the prediction of next token is not same as the thinking process. Interestingly, some reasoning and arithmetic ability of LLMs can be unlocked by Chain-of-thought prompting. A chain of thought is a series of intermediate natural language reasoning steps that lead to the final output. Sufficiently large language models can generate chains of thought if demonstrations of chain-of-thought reasoning are provided in the exemplars for few-shot prompting: ⟨input, chain of thought, output⟩. Why and how it works is not clear to us though.

Reinforcement Learning from Human Feedback (RLHF)

The language modeling objective used for LLMs — predicting the next token — is different from the objective “follow the user’s instructions helpfully and safely”. Thus, we say that the language modeling objective is misaligned.

InstructGPT aligns language models with user intent on a wide range of tasks by using reinforcement learning from human feedback (RLHF). This technique uses human preferences as a reward signal to fine-tune models.

Step 1: Collect demonstration data, and train a supervised policy. Labelers provide demonstrations of the desired behavior on the input prompt distribution. Then fine-tune a pre-trained GPT3 model on this data using supervised learning.

Step 2: Collect comparison data, and train a reward model. Collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input. Then train a reward model to predict the human-preferred output.

Step 3: Optimize a policy against the reward model using PPO. Use the output of the RM as a scalar reward. Fine-tune the supervised policy to optimize this reward using the PPO algorithm.

Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy.

Instruction Fine-Tuning

While supervised fine-tuning introduced in GPT-1 focuses on task specific tuning, T5 is trained with a maximum likelihood objective (using “teacher forcing”) regardless of the task. Essentially, T5 leverages the same intuition as zero-shot transfer that NLP tasks can be described via natural language instructions, such as “Is the sentiment of this movie review positive or negative?” or “Translate ‘how are you’ into Chinese.” To specify which task the model should perform, T5 adds a task-specific (text) prefix to the original input sequence before feeding it to the model. Further, FLAN explores instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data.

For each dataset, FLAN manually composes ten unique templates that use natural language instructions to describe the task for that dataset. While most of the ten templates describe the original task, to increase diversity, for each dataset FLAN also includes up to three templates that “turned the task around,” (e.g., for sentiment classification we include templates asking to generate a movie review). We then instruction tune a pretrained language model on the mixture of all datasets, with examples in each dataset formatted via a randomly selected instruction template for that dataset.

The so-called prompt engineering is essentially a reverse engineering how the training data are prepared for instruction fine-tuning and in context learning.

Retrieval Augmented Generation (RAG)

Due to the cost and time, LLMs in production usages are often lagged in term of training data freshness. To address this issue, we may use LLMs in the way of Retrieval Augmented Generation (RAG). In this use case, we do not want the LLM to generate text based solely on the data it was trained over, but rather want it to incorporate other external data in some way. With RAG, LLMs can also answer (private) domain specific questions. Therefore, RAG is also referred as “open-book” question answering. LLM + RAG could be an alternative to classic search engine. In other word, it acts as information retrieval with hallucination.

Currently, the retrieval part of RAG is often implemented as k-nearest neighbor (similarity) search on a vector database that contains the vector embedding of external text data. For example, DPR formulates encoder training as a metric learning problem.However, we should notice the information retrieval is generally based on relevance, which is different from similarity. I expect that there will be many more improvements in this area in the future.

Conclusion

LLM is an exciting area and will experience rapid innovations. I hope that this post helps you a little bit understand how it works. Besides excitement, we should also notice that LLMs learn language in a very different way from humans — they lack access to the social and perceptual context that human language learners use to infer the relationship between utterances and speakers’ mental states. They are also trained in a different way from human’s thinking process. These could be the areas to improve LLMs or to invent new paradigms of learning algorithms.