ChatGPT, GPT-4, and GPT-5: How Large Language Models Work

Uncover GPT-3.5 and GPT-4 behind ChatGPT, like in-context learning, CoT, RLHF, and multimodal pre-training

Published in

Predict

8 min readApr 5, 2023

Large Language Models — Tree of Life (photo courtesy by author)

I have to be honest, ChatGPT has emerged as a groundbreaking AI language model, transforming our interactions with machines like never before. Its capacity for generating human-like responses has captured the imagination of the world.

The internet is inundated with articles on ChatGPT and GPT-4. Here, we’ll take a comprehensive yet succinct look at the origins of ChatGPT, the inner workings of large language models, their training methodology, and much more.

OpenAI ChatGPT from GPT-3.5 to GPT-4 and GPT-5

Sam Altman, OpenAI’s co-founder and CEO, has led the company to achieve remarkable milestones, as outlined below:

ChatGPT initially drew inspiration from GPT-3.5, a cutting-edge large language model that amazed the world with its prowess in writing, coding, and tackling complex math problems, among other astonishing accomplishments.

GPT-4 is the latest iteration of ChatGPT, functioning as a large-scale, multimodal model that processes both image and text inputs to generate text outputs. Though it may not surpass human capabilities in certain real-world situations, GPT-4 demonstrates human-level performance across a range of professional and academic benchmarks, raising the bar for language model capabilities.

GPT-5 is set to launch in the coming months, with expectations of enhanced performance in terms of scope, accuracy, and reasoning. It will be a more comprehensive multimodal large-scale model, supporting text, image, video, and 3D as inputs and outputs.

The rapid advancement and capabilities of ChatGPT have not only amazed people but also sparked widespread concerns. Elon Musk and other tech leaders urge a pause on training AI systems more powerful than GPT-4.

How do ChatGPT Large Language Models Work?

The latest large language models are almost all large-scale pre-trained foundation models, including the following key mechanisms: transformer-based architecture with self-attention mechanisms, pre-training with self-supervised learning, transfer learning and fine-tuning on pre-trained models, etc.

ChatGPT's outstanding performance employs these critical techniques: in-context learning, chain of thought, codex, InstructGPT, and reinforcement learning from human feedback (RLHF).

In-context learning

In-context learning refers to the process by which a model understands, adapts, and responds to new information based on the context provided in the input. In-context learning is an essential feature of large-scale pre-trained models, as it allows them to perform well on a wide range of tasks without explicit fine-tuning.

In-context learning in models like GPT-4 involves processing input within a context window, leveraging attention mechanisms to focus on relevant information, predicting subsequent tokens based on pre-trained knowledge and context, and continually updating its understanding to better adapt to the task at hand.

ChatGPT in GPT3.5 uses few-shot learners.

Chain of thought

Chain of thought (CoT) is a technique for eliciting explanations from language models, while in-context learning is a technique for training language models to perform tasks.

CoT was first proposed by Google researchers in 2022. They found that by prompting models to generate CoTs, they were able to improve the accuracy of their answers on a range of arithmetic, commonsense, and symbolic reasoning tasks.

There are two main methods to elicit chain-of-thought reasoning: few-shot prompting and zero-shot prompting. Few-shot prompting involves providing the model with one or more examples of a question paired with a CoT. Zero-shot prompting involves simply appending the words “Let’s think step-by-step” to the prompt.

Here is an example of zero-shot CoT prompting:

Prompt: "What is the capital of France?"
Original response: "Paris"
Zero-shot-CoT response: "Let's think step by step. France is a country in Europe. The capital of France is Paris."

As you can see, the zero-shot CoT response is more detailed and provides a more logical explanation of how the model arrived at its answer.

Both methods have been shown to be effective in eliciting CoTs from models. However, few-shot prompting has been shown to be more effective, especially for complex problems.

Codex and InstructGPT

Codex is a GPT-3-based LLM translating natural language to code, while InstructGPT is a GPT-3-based LLM following instructions in natural language.

Codex was originally called Codex Initial, but the name was changed to Codex in 2022. The name change was to reflect Codex is no longer a “prototype” or “initial” version of the model.

InstructGPT was originally called Codex Follow Instructions, but the name was changed to InstructGPT in 2022. It means that InstructGPT is a more general-purpose model than Codex Follow Instructions.

InstructGPT is designed to follow instructions provided in the input using natural language understanding (NLU) and generate detailed, accurate, and helpful responses.

Codex and InstructGPT are connected together with multimodal as a single ChatGPT or GPT-4. This means that they can all access the same information and can work together to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way.

Reinforcement learning from human feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) combines reinforcement learning with human feedback to enhance the performance of AI agents. It trains a reward model based on human feedback, which is then used as a reward function to optimize the agent’s policy through algorithms like Proximal Policy Optimization. This approach is particularly beneficial in scenarios with sparse or noisy reward functions and has applications in natural language processing tasks, such as conversational agents and text summarization.

RLHF involves three main steps:

Collecting human feedback on AI-generated outputs in the form of ratings or rankings.
Training a reward model to predict human evaluations of these outputs.
Optimizing the agent’s policy using the reward model to generate higher-quality outputs.

Although still in its early stages, RLHF shows great promise in improving the accuracy and reliability of language models. ChatGPT relies on RLHF for high-quality relevant responses.

How were GPT and GPT-4 Trained?

Training GPT models, like GPT-4, involves a two-step process: pre-training and fine-tuning. The process is similar to previous versions of GPT but can be applied to larger models and datasets. Here’s an outline of the training process:

Data collection and preprocessing: Gather a large text corpus from diverse sources, such as websites, books, articles, and other text documents. Preprocess the data by removing irrelevant content, tokenizing the text, and splitting it into smaller chunks or sequences. Make sure the dataset is sufficiently large and diverse to capture the nuances and structure of the language.
Pre-training: Initialize a transformer-based neural network architecture with a large number of layers, attention heads, and hidden units. Pre-train the model using self-supervised learning, specifically the masked language modeling (MLM) task. In this task, the model is trained to predict a masked token given the context of the surrounding tokens. During pre-training, the model learns general language representations, grammar, syntax, and semantic patterns. The pre-training phase usually involves training the model on large-scale computing resources, such as multiple GPUs or TPUs, and can take several days or weeks to complete.
Fine-tuning: After the pre-training phase, fine-tune the model on smaller, task-specific labeled datasets. Fine-tuning adapts the model to perform specific tasks, such as text summarization, translation, question-answering, or sentiment analysis. Fine-tuning can be done using supervised learning or, in some cases, reinforcement learning, depending on the task and available data. During fine-tuning, you can also experiment with different hyperparameters, such as learning rate, batch size, and the number of training epochs, to optimize the model’s performance.
Evaluation and deployment: Evaluate the performance of the fine-tuned model using relevant metrics, such as accuracy, F1 score, or BLEU score. If the model’s performance is satisfactory, deploy it for real-world applications, such as chatbots, content generation, or text analysis.

However, training GPT models, especially large ones like GPT-4, requires significant computational resources and expertise. Access to high-quality, diverse data is also crucial for achieving good performance.

Is 10-billion Parameters the Tipping Point for Large Language Models?

10 billion parameters is a significant milestone for large language models (LLMs). It is at this point that LLMs start to show significant improvements in their ability to understand and generate text.

Before 10-billion parameters, LLMs were still relatively limited in their capabilities. They could generate text that was grammatically correct and factually accurate, but they often struggled to understand the nuances of human language.

With 10 billion parameters and more, LLMs are able to learn much more complex patterns in language. They can understand the meaning of words and phrases in context, and they can generate text that is both grammatically correct and semantically meaningful.

This is a major turning point in the development of LLMs, and it is likely to lead to significant improvements in their performance in a variety of tasks. For example, LLMs will be able to better understand and translate languages, generate more creative and original content, and even hold conversations with humans that are indistinguishable from those between two humans.

*Capabilities vs. Scale of LLMs (Source:* PaLM)

However, based on the above relationship between capabilities and scale, 10-billion parameters is not a magic number. There are many other factors that contribute to the performance of LLMs, such as the quality of the training data and the architecture of the model.

Complete LLMs and Their Stats

ChatGPT and LLMs were emerging very rapidly. Let’s have a comprehensive look.

GPT-4 has Common Sense Grounding

There’s a lot of excitement about ChatGPT and GPT-4, but I’d like to end with a fundamental theme: GPT-4 has common sense grounding like humans.

Here is an example of how GPT-4 common sense grounding can be used to generate text:

Prompt: "A dog is a mammal. Mammals have fur. What color is a dog's fur?"
GPT-4 response: "A dog's fur is usually brown, but it can also be black, white, or even red."

GPT-4 common sense grounding is the enhanced ability of the GPT-4 model to reason and understand the world using common sense knowledge. It works by leveraging pre-training, fine-tuning, attention mechanisms, context understanding, and prediction to generate more accurate and contextually appropriate responses that rely on common sense knowledge.

Microsoft Research claims GPT-4 could be considered an early form of AGI based on extensive testing. It is exciting to see what the future holds.

Related Resources

OpenAI RLHF: Training language models to follow instructions with human feedback: https://arxiv.org/abs/2203.02155
A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF): https://github.com/CarperAI/trlx
Chain-of-ThoughtPrompting Elicits Reasoning in Large Language Models: https://arxiv.org/abs/2201.11903
Language Models are Few-Shot Learners: https://arxiv.org/abs/2005.14165
Large Language Models are Zero-Shot Reasoners: https://arxiv.org/abs/2205.11916
Stanford DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature: https://arxiv.org/abs/2301.11305
A Survey of Large Language Models (3/31/2023): https://arxiv.org/abs/2303.18223