Glossary of LLM and Generative AI

Tensor
20 min readApr 29, 2023

--

The following glossaries are from my personal learning of LLM and Generative AI.

General Terminology

LM: Language Model, a language model is a probability distribution over sequences of words.

LLM: Large Language Model, a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabeled text using self-supervised learning.

Generative AI: A generative artificial intelligence or generative AI / (GenAI) is a type of AI system capable of generating text, images, or other media in response to prompts. Generative AI systems use generative models such as large language models to produce data based on the training data set that was used to create them.

NLP: Natural language processing is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

Hallucination: In artificial intelligence (AI), a hallucination or artificial hallucination (also occasionally called confabulation or delusion) is a confident response by an AI that does not seem to be justified by its training data.

Emergent Abilities: an ability to be emergent if it is not present in smaller models but is present in larger models. Emergent abilities are humanity’s serendipities says Shuchao.

Alignment: AI alignment research aims to steer AI systems towards humans’ intended goals, preferences, or ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system is competent at advancing some objectives, but not the intended ones.

Grounding: Grounding in large language modeling is the process of associating words and phrases with their corresponding real-world entities and concepts. This is important for a number of reasons, including: 1. Ensuring that the language model is generating text that is consistent with reality. For example, if the language model is asked to generate a description of a cat, it should not generate text that is inconsistent with our real-world knowledge of cats, such as saying that cats can fly. 2. Improving the accuracy of the language model’s responses. For example, if the language model is asked to answer a question about the capital of France, it should be able to correctly answer “Paris” if it has been grounded to the real-world entity “Paris” 3. Making the language model more informative. For example, if the language model is asked to generate a description of a dog, it should be able to provide more information about dogs than just their physical appearance, such as their lifespan, diet, and behavior. There are a number of different approaches to grounding in large language modeling. One common approach is to use a knowledge base, such as Wikipedia, to associate words and phrases with their corresponding real-world entities and concepts.

Base model / foundational model: a pre-trained language mdoel that serves as the starting point for building more specific models for downstream tasks. Once a foundational LLM is pre-trained, it can be fine-tuned on a specific NLP task, such as text classification, question answering, or language translation, by training it on a smaller set of task-specific data. The fine-tuning process adjusts the weights and biases of the model to better fit the particular task, while still preserving the knowledge learned during pre-training. Foundational LLMs are typically very large and complex, requiring significant computational resources to train and fine-tune. However, they have been shown to be highly effective at a wide range of NLP tasks, and their use has enabled significant advances in the field of natural language processing. Examples of foundational LLMs include BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and T5 (Text-to-Text Transfer Transformer).

Prompt Engineering: a process of designing and constructing effective natural language prompts for use with large language models. Prompts are input patterns that are used to guide the behavior of LLMs and generate text that is relevant to a specific task or domain. Prompt engineering involves designing prompts that are well-suited to the specific task at hand, and that provide enough information for the model to generate high-quality output. Effective prompt engineering involves several key steps, including selecting appropriate keywords and phrases that are relevant to the task, specifying the desired output format or structure, and optimizing the prompt for performance and efficiency. For example, in the context of language translation, prompt engineering might involve specifying a prompt that includes the source language text and the desired target language, along with any additional context or constraints that are relevant to the translation task. In the context of question answering, prompt engineering might involve designing prompts that include relevant background information and key terms that are likely to be associated with the answer.

Zero-shot Prompting/Learning: Zero-shot learning in NLP allows a pre-trained LLM to generate responses to tasks that it hasn’t been specifically trained for. In this technique, the model is provided with an input text and a prompt that describes the expected output from the model in natural language.

One-shot Prompting/Learning: a prompting technique that allows a model to process one labeled example before attempting a task.

Few-shot Prompting/Learning, aka in-context learning: a prompting technique that allows a model to process multiple labeled examples before attempting a task.

Chain-of-Thought Prompting: Chain-of-thought prompting (CoT) improves the reasoning ability of LLMs by prompting them to generate a series of intermediate steps that lead to the final answer of a multi-step problem.

Modality: A high-level data category. For example, numbers, text, images, video, and audio are five different modalities.

Context Window / Context Length: context window is the number of tokens that are considered when predicting the next token.

Temperature: a hyperparameter that controls the randomness of the model's output. A temperature of 0 means always output the highest probability token.

Model Structure and Model Training

Transformer: The Transformer model is a neural network architecture that is particularly effective for natural language processing tasks. The basic structure of the Transformer can be broken down into several components:

  1. Input Embeddings: The input sequence is first transformed into a vector space using an embedding layer. Each element in the sequence (e.g., word or character) is represented as a dense vector.
  2. Positional Encoding: Since the Transformer does not use recurrent connections, it needs a way to capture the order and position of the input sequence. This is achieved using positional encoding, which adds a fixed encoding to the input embeddings based on the position of the element in the sequence.
  3. Encoder: The encoder consists of multiple layers of self-attention and feed-forward neural networks. In each layer, the input embeddings are first transformed using multi-head self-attention, which allows the model to selectively attend to different parts of the input sequence. The resulting vectors are then passed through a feed-forward neural network, which applies a non-linear transformation to each vector independently.
  4. Decoder: The decoder also consists of multiple layers of self-attention and feed-forward neural networks. In each layer, the decoder takes as input a combination of the output of the previous layer and the encoded input sequence. The decoder is trained to generate the output sequence, one element at a time, by predicting the next element in the sequence based on the previous elements and the encoded input.
  5. Output Layer: The final layer of the Transformer is an output layer that takes as input the output of the decoder and generates the final output sequence. In many cases, this is a softmax layer that generates probabilities for each element in the output vocabulary.

The Transformer model is highly parallelizable, which makes it well-suited for training on large datasets using modern hardware such as GPUs or TPUs. It has been shown to achieve state-of-the-art performance on a wide range of natural language processing tasks, including machine translation, language modeling, and text classification.

Attention: attention mechanism is a key component in the Transformer model, which is a popular deep learning architecture for natural language processing tasks, such as machine translation and language understanding. The attention mechanism allows the model to selectively focus on different parts of the input sequence when encoding or decoding it.

Self-attention: Self-attention mechanism is a key component of the Transformer architecture. In a self-attention mechanism, the input sequence is transformed into queries, keys, and values, which are then used to compute a set of attention scores for each input element. Unlike a traditional attention mechanism, where the queries and keys come from separate input sequences, in self-attention mechanism, the queries, keys, and values all come from the same input sequence.

CLM: causal language modeling, a pretraining task where the model reads the texts in order and has to predict the next word.

MLM: masked language modeling, a pretraining task where the model sees a corrupted version of the texts, usually done by masking some tokens randomly, and has to predict the original text.

Encoder-only models: Encoder-only language models are neural network models that use only an encoder module to process input text and generate contextualized representations of the text. In contrast to traditional Transformer models, which use both an encoder and a decoder module to generate output text, encoder-only LLMs are designed to encode text into a fixed-length vector representation that can be used for a variety of downstream natural language processing (NLP) tasks, such as text classification, question answering, and language translation. Encoder-only LLMs are typically trained using unsupervised learning techniques, such as pre-training on large corpora of text data using self-supervised learning objectives, such as masked language modeling (MLM) or next sentence prediction. The resulting models can then be fine-tuned on task-specific supervised learning data to achieve state-of-the-art performance on a wide range of NLP tasks. Some popular examples of encoder-only LLMs include BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa (Robustly Optimized BERT approach). Encoder-only models are good at understanding content but not good at generating new content.

Encoder-Decoder models: encoder-decoder language models are neural network models that consist of two main components: an encoder and a decoder. The encoder processes input text and generates a fixed-length vector representation of the input text, while the decoder uses this representation to generate output text. Encoder-decoder LLMs are commonly used for sequence-to-sequence (seq2seq) learning problems, such as language translation and text summarization. Encoder-decoder LLMs are typically trained using supervised learning techniques, where the model is trained to minimize a loss function that measures the difference between the predicted output and the ground-truth output. During training, the input sequence is fed into the encoder, and the decoder generates the corresponding output sequence one token at a time. The loss function is calculated based on the predicted output and the ground-truth output, and the model parameters are updated using backpropagation. Some popular examples of encoder-decoder LLMs include the sequence-to-sequence model with attention (Seq2Seq) and the Transformer model.

Decoder-only models: decoder-only language models are neural network models that use only a decoder module to generate output text, without an accompanying encoder module. Unlike encoder-decoder LLMs, which generate output text based on a fixed-length vector representation of the input text, decoder-only LLMs generate output text based on an autoregressive process, where the output tokens are generated one at a time based on the previously generated tokens. Decoder-only LLMs are commonly used for language generation tasks, such as text completion and text generation. In text completion, for example, the model is given a partial input sequence and is asked to generate the remaining sequence. Decoder-only LLMs are typically trained using maximum likelihood estimation (MLE), where the model is trained to maximize the likelihood of the ground-truth output given the input sequence. During training, the model is fed the input sequence and is trained to generate the ground-truth output one token at a time (CLM). The model parameters are updated using backpropagation and the chain rule of probability. Some popular examples of decoder-only LLMs include the GPT (Generative Pre-trained Transformer) family of models. Decoder-only models are extremely good at generating new content.

Vision Transformer: The Vision Transformer (ViT) model was proposed in “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. It’s the first paper that successfully trains a Transformer encoder on ImageNet, attaining very good results compared to familiar convolutional architectures. A image is being cut into 16x16 grids and treat each grad as a token (word).

Self-supervised learning: a type of machine learning in which a model is trained using unlabeled data. Unlike supervised learning, where the model is trained on labeled data (i.e., data that is explicitly annotated with labels indicating the correct output), self-supervised learning does not require any labeled data. Instead, the model is trained to make predictions about some aspect of the input data, such as predicting the next word in a sequence or filling in a missing segment of an image. The goal of self-supervised learning is to learn a useful representation of the input data that can be used for downstream tasks. By learning to predict certain aspects of the input data, the model can extract useful features and patterns from the data that can be used for other tasks.

Embedding: a mapping from a high-dimensional space (such as a one-hot encoded vector) to a lower-dimensional space (such as a dense vector with a fixed number of dimensions). Embeddings are commonly used to represent categorical or discrete variables (such as words, users, or products) as continuous vectors that can be used as input to a neural network.

Pre-training: a key aspect of large language modeling, which involves training a language model on a massive amount of unlabeled text data. The purpose of pre-training is to teach the model to understand the underlying structure of language and to learn useful patterns and relationships between words and phrases.

Fine-tuning, aka Adaption Tuning or Domain Adaption: the process of adapting a pre-trained language model to a specific task by training it on a smaller, task-specific dataset. Fine-tuning is an approach to transfer learning in which the weights of a pre-trained model are trained on new data. Fine-tuning can be done on the entire neural network, or on only a subset of its layers, in which case the layers that are not being fine-tuned are “frozen” (not updated during the backpropagation step).

Instruction Tuning:Instruction tuning is the approach to fine-tuning pre-trained LLMs on a collection of formatted instances in the form of natural language, which is highly related to supervised fine-tuning and multi-task prompted training. Instruction tuning is an emergent paradigm in NLP wherein natural language instructions are leveraged with language models to induce zero-shot performance on unseen tasks.

PEFT: Parameter-Efficient Fine-Tuning methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model’s parameters. Fine-tuning large-scale PLMs is often prohibitively costly. In this regard, PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning.

Popular PEFT methods.

1. LoRA, Low-Rank Adaptation: An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Low-rank adaptation freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing (10000 smaller) the number of trainable parameters for downstream tasks.

2. Prefix Tuning: a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen and instead optimizes a sequence of continuous task-specific vectors, which we call the prefix. Prefix-tuning draws inspiration from prompting for language models, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”.

3. Prompt Tuning: a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples. Our end-to-end learned approach outperform GPT-3’s few-shot learning by a large margin

4. AdaLoRA, Adaptive LoRA: many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way, e.g., low-rank increments. These methods often evenly distribute the budget of incremental updates across all pre-trained weight matrices, and overlook the varying importance of different weight parameters. As a consequence, the fine-tuning performance is suboptimal. To bridge this gap, we propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score.

Alignment Tuning: Since LLMs are trained to capture the data characteristics of pre-training corpora (including both high-quality and low-quality data), they are likely to generate toxic, biased, or even harmful content for humans. It is necessary to align LLMs with human values, e.g., helpful, honest, and harmless.

RLHF, reinforcement learning from human feedback is a technique that trains a “reward model” directly from human feedback and uses the model as a reward function to optimize an agent’s policy using reinforcement learning through an optimization algorithm like Proximal Policy Optimization. The reward model is trained in advance to the policy being optimized to predict if a given output is good (high reward) or bad (low reward). RLHF can improve the robustness and exploration of RL agents, especially when the reward function is sparse or noisy.

Diffusion: Diffusion Models are a class of probabilistic generative models used in machine learning to simulate the dynamics of complex systems over time. Unlike traditional generative models such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), Diffusion Models simulate the dynamics of a stochastic process that evolves over time, rather than generating a fixed sample from a learned distribution. Diffusion Models work by starting with a base distribution, typically a Gaussian distribution, and then gradually transforming the distribution over time by adding noise. This process is often referred to as “diffusing” the base distribution. To generate a sample from a Diffusion Model, the model is “reverse diffused” from a noisy sample to the base distribution, using a series of “diffusion steps”. This process involves iteratively removing noise from the sample and updating the distribution to generate a more accurate representation of the underlying data. Diffusion Models have been shown to be effective in generating high-quality samples from complex high-dimensional data, such as natural images and videos. They have also been applied to various machine learning tasks, such as image denoising and super-resolution, as well as to modeling complex systems in physics and biology.

GAN, Generative Adversarial Network: a type of deep neural network architecture that is used to generate new data that is similar to the training data. The architecture of GANs consists of two neural networks: a generator and a discriminator. The generator network takes a random input (usually a vector of noise) and produces a sample that resembles the training data. The discriminator network takes a sample, either real or generated, and classifies it as either real or fake. The training process of GANs is adversarial, which means that the generator and the discriminator networks are trained against each other. As the generator network gets better at generating realistic samples, the discriminator network becomes better at distinguishing between real and fake samples. This back-and-forth training process continues until the generator network produces samples that are indistinguishable from the real samples. GANs have been used in various applications such as image and video generation, text generation, and music generation. They have also been used for other tasks such as image super-resolution, image inpainting, and style transfer.

VAE, Variational Autoencoder, a type of neural network architecture used in unsupervised learning for generative modeling. The VAE model consists of two parts: an encoder and a decoder. The encoder maps the input data to a latent representation (a lower-dimensional space) that captures the most important features of the input data. The decoder then takes this latent representation and maps it back to the original input space. During training, the VAE is optimized to minimize the difference between the input data and the output of the decoder. However, in addition to this reconstruction loss, the VAE also includes a regularization term that encourages the latent representation to follow a standard normal distribution. This regularization term is what makes the VAE a “variational” autoencoder. The regularization term is achieved by introducing a Kullback-Leibler (KL) divergence penalty into the loss function. This penalty encourages the learned latent representation to have a similar distribution to a standard normal distribution. This makes the latent representation more easily sampled and allows for the generation of new data points by sampling from the learned distribution. VAEs have been used in various applications such as image and video generation, text generation, and music generation. They have also been used for other tasks such as image inpainting, denoising, and style transfer. VAEs are particularly useful for generative modeling tasks where there is no labeled data available.

Flow-based deep generative models: a type of deep learning model used for generative modeling, which involves learning a probability distribution over a set of data, with the aim of generating new data that is similar to the original set. In flow-based models, the probability distribution is represented by a series of invertible transformations, or “flows,” applied to a simple base distribution, such as a Gaussian distribution. These transformations are typically implemented using neural networks, which allow for complex and flexible mappings between the input data and the latent space where the probability distribution is defined. The key advantage of flow-based models is that they can generate samples efficiently, as the inverse transformations can be easily computed. They are also well-suited to modeling high-dimensional and continuous data, such as images and audio.

Model Compression Techniques

Pruning: Pruning involves removing unnecessary connections or nodes from the neural network. This can be done based on various criteria such as weight magnitude, activation values, or gradients. Pruning can significantly reduce the size of the model while preserving its accuracy.

Quantization: Quantization involves reducing the precision of the model’s weights and activations from floating-point numbers to fixed-point numbers with fewer bits. This can reduce the memory requirements and computational complexity of the model, but may also result in a slight decrease in accuracy.

Distillation: Knowledge distillation involves training a smaller model (the student) to mimic the predictions of a larger, more complex model (the teacher). This can transfer the knowledge and accuracy of the larger model to the smaller model, allowing it to achieve similar performance with fewer parameters.

Low-rank factorization: Low-rank factorization involves decomposing the weight matrices of the neural network into smaller matrices with lower rank. This can reduce the number of parameters and computations required by the model, while preserving its accuracy.

Model Acronyms

GPT (Generative Pre-trained Tansformer): a family of language models developed by OpenAI. The GPTs are encoder-only models trained to predict the next word in a sequence of text given the previous words.

InstructGPT: InstructGPT models are trained with RLHF (Reinforcement Learning with Human Feedback), which are much better at following user intentions than GPT-3 while also making them more truthful and less toxic.

ChatGPT: ChatGPT is a sibling model to InstructGPT, which is trained to follow an instruction in a prompt and provide a detailed response.

GLaM (Generalist Language Model): mixture of experts (MoE) model, a type of model that can be thought of as having different submodels (or experts) that are each specialized for different inputs.

LaMDA (Language Model for Dialogue Applications):a Transformer-based large language model developed by Google trained on a large dialogue dataset that can generate realistic conversational responses.

PaLM (Pathways Language Model): a 540-billion parameter, dense decoder-only Transformer model trained with the Pathways system, which enabled efficient training of a single model across multiple TPU v4 Pods.

FLAN (Fine-tuned LAnguage Network): more generalizable Language Models with Instruction Fine-Tuning.

FLAN-PaLM: fine-tuned PaLM.

Gopher: a transformer language models of 280 billion parameters developed by DeepMind.

Chinchilla: a transformer language model developed by Deepmind using the same compute budget as Gopher but with 70B parameters and 4x more data.

Sparrow: a dialogue agent that’s useful and reduces the risk of unsafe and inappropriate answers developed by DeepMind. Sparrow agent is designed to talk with a user, answer questions, and search the internet using Google when it’s helpful to look up evidence to inform its responses.

OPT (Open Pretrained Transformer): a language model with 175 billion parameters trained on publicly available data sets developed by Meta. Decoder-only model.

LLaMA: a foundational, 65-billion-parameter large language model developed by Meta. Decoder-only model.

BERT: Bidirectional Encoder Representations from Transformers — a type of LLM developed by Google that is pre-trained on large amounts of text data. Encoder-only model.

T5: Text-To-Text Transfer Transformer.

BLOOM (BigScience Large Open-science Open-access Multilingual Language Model): a 176 billion parameter developed by Hugging Face, trained in 46 natural languages and 13 programming languages.

Alpaca-7B: a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations, developed by Stanford.

Koala: a chatbot trained by fine-tuning Meta’s LLaMA on dialogue data gathered from the web, developed by Berkeley.

Vicuna-13B: an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT.

Stable Diffusion: an open source latent text-to-image diffusion model by Stability AI and their collaborators including Runway.

LLM Argumentation and Applications

Memory: Memory is the concept of persisting state between calls of a chain/agent. LangChain provides a standard interface for memory, a collection of memory implementations, and examples of chains/agents that use memory.

Indexes: Language models are often more powerful when combined with your own text data — this module covers best practices for doing exactly that.

Chains: Chains go beyond just a single LLM call, and are sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.

Agents: Agents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done. LangChain provides a standard interface for agents, a selection of agents to choose from, and examples of end to end agents.

LLM and Generative AI Related Companies and Tools

Huggging Face: model repository with lots of great open source tools to train and serve large language models, including Transformers, Accelerate, etc..

DeepSpeed is a deep learning optimization library (compatible with PyTorch) developed by Microsoft, which has been used to train a number of LLMs, such as BLOOM.

JAX: Jax is a Python library designed for high-performance ML research, developed by Google and used internally both by Google and DeepMind teams.

OpenAI: OpenAI is an AI research and deployment company. Our mission is to ensure that artificial general intelligence benefits all of humanity.

Anthropic: Anthropic is an AI safety and research company. We build reliable, interpretable, and steerable AI systems. Anthropic’s search on interpretability of Tranformer is impressive. A lot of people know how to train LLMs to get emergent abilities and few people understanding how and why emergent abilities emerge.

Cohere: Cohere empowers every developer and enterprise to build amazing products and capture true business value with language AI.

DeeepMind: AI could be one of humanity’s most useful inventions. We’re a team of scientists, engineers, ethicists and more, committed to solving intelligence, to advance science and benefit humanity.

Midjourney: image generation product, is an independent research lab exploring new mediums of thought and expanding the imaginative powers of the human species.

Stability AI: image generation, AI by the people for the people. Our goal is to maximize the accessibility of modern AI to inspire global creativity and innovation.

Runway: with video generation product, make the impossible & move creativity forward.

Character.AI: Character.ai is a neural language model chatbot web application that can generate human-like text responses and participate in contextual conversation.

AI21 Labs: AI21 Labs is a Tel Aviv-based company specializing in Natural Language Processing, which develops AI systems that can understand and generate natural language.

EleutherAI: EleutherAI is a grass-roots non-profit artificial intelligence research group. The group, considered an open-source version of OpenAI, was formed in a Discord server in July 2020 to organize a replication of GPT-3. EleutherAI is a non-profit AI research lab that focuses on interpretability and alignment of large models. EleutherAI’s mission is to empower open-source artificial intelligence research.

Adept: An AI teammate for everyone. Adept is building an entirely new way to get things done. It takes your goals, in plain language, and turns them into actions on the software you use every day.

Glean: Bring people the knowledge they need to make a difference in the world. Glean is the enterprise search and knowledge discovery solution for modern teams. Search all company apps, find what you need, and discover what you should.

--

--