In this article, I explain what happens inside the black box of ChatGPT when you ask it a question.
Input Text and Tokenization:
For example, your question: “The cat sits on the mat.” is tokenized, meaning the text is broken down into smaller units called tokens. These can be characters, words, or phrases. It might look like this:
- Input text: “The cat sits on the mat.”
- Tokens: [“The”, “cat”, “sits”, “on”, “the”, “mat”, “.”]
Each token is assigned a token index since the model has a fixed vocabulary, which is a list of all possible tokens it can process. The tokens from the input are mapped to their designated indices. This vocabulary is established during training and might look like this:
| Token | Index |
| - - - | - - - |
| "The" | 1012 |
| "cat" | 1200 |
| "sits"| 4589 |
| "on" | 1234 |
| "the" | 6789 |
| "mat" | 2234 |
| "." | 102 |
2. Token Embeddings:
Embeddings: Each token is converted into a multi-dimensional numerical vector (embedding). These vectors summarize the meaning of the tokens, and position embeddings are added to store the position of the embedding in the original question. Here’s an example of what embeddings look like:
- Token: “cat”
- Token Embedding: [0.25, -0.67, 0.34, …, 0.12] # A vector in a 768-dimensional space, for example.
- Position Embedding: [0.04, −0.21, 0.13, …, 0.30] # “cat” is the second word in the question.
- Final Embedding: Token Embedding + Position Embedding
Embeddings Training: The model already has an embedding prepared for each token, which was created during training. At the start of training, each token was assigned a random embedding. During training, the model had to predict the next token that best matched the previous one. Once the model selected a token, the result was checked by a loss function that indicated how far off the predicted token was from the correct token provided in the dataset. With this feedback, the model gradually adjusted the embeddings stored in the embedding matrix or embedding layer. Here’s an example of an embedding matrix:
| Token | Dim 1 | Dim 2 | … | Dim D | # Dim = Dimension
| - - - | - - - | - - - | - | - - - |
| "cat" | 0.12 | -0.34 | … | 0.78 |
| "dog" | 0.45 | -0.12 | … | 0.36 |
| "car" | -0.23 | 0.67 | … | -0.15 |
| "run" | 0.14 | 0.56 | … | 0.29 |
| "fast"| -0.31 | 0.43 | … | 0.22 |
3. Transformer Architecture:
Once the embeddings are created, they are fed into the transformer architecture. In ChatGPT 4, this consists of 120 layers, each comprising self-attention and feed-forward networks, calculated separately for each token.
Self-Attention Mechanism: This allows the model to recognize the relationships between different tokens in the sentence.
- Query, Key, and Value: For each token in the sentence, three vectors are created: Query (Q), Key (K), and Value (V). These vectors are generated by multiplying with weight matrices.
- Query (Q): Represents the token for which we want to calculate attention. It asks, “Which other tokens should this token be related to?”
- Key (K): Represents the other tokens in the sentence that are considered as potential candidates for the attention of the query token. It asks, “Which tokens are relevant to the query token?”
- Value (V): Represents the information to be extracted from the relevant tokens (the keys). It answers, “What information should be used from the relevant tokens?”
Weight Matrices Wq, Wk, and Wv: These matrices are model parameters that were multiplied with the tokens during training to calculate Q, K, and V. During training, these matrices were repeatedly multiplied with the embeddings, and the resulting parameters were corrected, making them increasingly accurate.
- Formula: X represents the input matrix of the token embeddings, and Wq, Wk, and Wv are the weight matrices.
- Attention Scores: Calculated by multiplying the Query with the Keys. These scores indicate how relevant each token is in relation to all other tokens.
- Scaling: The dot product between the Query and Key vectors is computed and then divided by the number of dimensions dk to stabilize the size of the dot product and prevent instability in calculating the attention scores.
- Softmax Function: The attention scores are then normalized through a softmax function to obtain probabilities. These probabilities indicate how much each token attends to the token currently being calculated (the query token).
- Weighted Sum: The normalized attention scores are then used to weight the value vectors. This means each value vector is multiplied by its corresponding normalized attention score. All these weighted value vectors are then summed to obtain the final output vector.
- Feed-Forward Networks: After the self-attention layer, the result is passed through a feed-forward network. This network consists of multiple layers of neurons connected together, applying nonlinear transformations to recognize more complex patterns in the text.
- Linear Transformation: This is the first step in the feed-forward network. Here, the input is transformed by matrix multiplication and the addition of a bias vector.
- Bias Vector: The bias vector allows the model to flexibly adjust its decision boundaries, improving its ability to adapt to data and make accurate predictions. It increases the flexibility of the model, enhancing its capacity to capture complex patterns.
- Activation Function: The output of the linear transformation is transformed by an activation function to determine the output of the feed-forward network. In ChatGPT 4, the GELU activation function is used, which is smooth and differentiable, making it suitable for modeling complex patterns in language data.
4. Output Generation
- Next-Token Prediction: After the transformer architecture processes the input, the model generates the next token based on the computed outputs. This is done by applying a softmax function to the output vectors to obtain probabilities for each possible next token.
- Softmax Function: Calculates a probability distribution from a set of raw values (logits), one of the outputs of the transformer architecture. The result of the softmax function is a probability distribution, which means we get a list of probabilities that sum up to 1. Each probability in this list indicates how likely it is that the corresponding token will be the next token in the text.
- WordLogitProbabilitycat2.00.66food1.00.24water0.10.10
- Token Sampling: After the probability distribution is calculated through the softmax function, the model must select a token. This is done through a process called “sampling.” ChatGPT 4 uses these two methods:
- Top-k Sampling: Selects from the k tokens with the highest probabilities.
- Top-p (Nucleus) Sampling: Selects from the smallest set of tokens whose cumulative probability is at least p.
- Loop for Generating Additional Tokens: After the first token is generated, it is appended to the end of the input, and the entire process is repeated. The model recalculates embeddings, performs self-attention and feed-forward networks, and generates the next token. This process is repeated until the desired length of the text is reached or a stop condition is met (e.g., a sentence-ending punctuation mark).
5. Post-Processing
- Text Post-Processing: After all tokens are generated, the text may be improved through post-processing steps, which can include:
- Spell check: Correcting spelling errors.
- Grammar check: Enhancing grammatical correctness.
- Coherence check: Ensuring that the generated text is coherent and logical.
Conclusion:
ChatGPT and general transformer models are extremely complex. My goal in this article was to explain the steps the model takes to generate text without diving too deeply into mathematical concepts. It’s important to know that models like ChatGPT 4 involve many other steps that we have skipped or that are not yet known. These steps are general steps used by any text generation model like Mistral Large, LLama3.1, and others.
This article is the beginning of a long series, also subscribe to be informed about new articles.