AI Explains AI (part two)

9 min readOct 18, 2023

ChatGPT on ChatGPT

This series of articles are some introductory questions and high-level answers posed to various AI technologies to uncover how they work, ‘in their own words’. Some questions were quite naive at first, in part due to my initial lack of detailed technical understanding of how AI tools work under the covers (haven’t seen the source code or talked to the engineers), and being perfectly willing to let the AI to correct the premise, which is partially enlightening in itself, and reveal technical facts of how they actually work.

This conversation was conducted with ChatGPT 4.

Q. When generating step by step instructions for cooking filet mignon, how do you determine which words or sequences of words are appropriate for the response?

ChatGPT:

When generating step-by-step instructions for cooking filet mignon or any other topic, the process involves a combination of understanding the context of the request, the patterns learned during training, and the probabilistic nature of the model. Here’s a simplified breakdown of how it works:

Context Understanding: The model first understands the context of the request, which in this case is “cooking filet mignon.” This context helps narrow down the scope of the potential response to be related to cooking methods, ingredients, and techniques associated with filet mignon.
Token Prediction: The model operates by predicting one token (word or character) at a time. Given the context and the preceding tokens in the response, the model calculates the probabilities of various next tokens. For example, after the phrase “Preheat your,” the model might determine that “oven” or “grill” are probable next words based on the context of cooking filet mignon.
Sequence Continuation: Once a token is selected, the model continues to predict the next token, and the next, and so on, until a complete and coherent instruction is formed. The model uses its understanding of language structure, grammar, and the specific topic to ensure the sequence makes sense.
Pattern Recognition: The model has been trained on a vast amount of text, so it recognizes patterns associated with cooking instructions, techniques, and methods. For filet mignon, it might recognize that searing, oven-roasting, and resting are common steps in many recipes.
Completion: The model will continue generating the sequence until it reaches a logical conclusion or endpoint for the instructions, ensuring the steps are coherent and relevant to the context of cooking filet mignon.
Refinement: If there are constraints or specific details provided in the initial request (e.g., “cooking filet mignon in a cast-iron skillet”), the model will use that information to refine and tailor the instructions accordingly.

It’s important to note that while the model aims to provide accurate and relevant instructions based on its training, it’s always a good idea to cross-reference with trusted sources or experts, especially for tasks that require precision, like cooking.

Q. If you don’t have access to training data, how are you able to select appropriate words to start with?

ChatGPT:

The model doesn’t “select” words in the way that a human might consciously choose them. Instead, it calculates probabilities of word sequences based on the input it receives (your questions or prompts) and the patterns it learned during training. Here’s a simplified breakdown:

Input Understanding: When you provide an input, the model processes it to understand the context and intent behind it.
Probability Calculation: Based on the input and the patterns learned during training, the model calculates the probabilities of various words or tokens being the appropriate next word in the sequence. It doesn’t “recall” specific examples from the training data but uses the statistical properties learned from it.
Token Generation: The model generates the next token (word or punctuation) in the sequence by sampling from the calculated probabilities. This process is influenced by the patterns and structures observed in the training data.
Sequence Continuation: The model continues to generate tokens, each time recalculating probabilities based on the new sequence, until it forms a complete and coherent response or reaches a specified token limit.
Output: The generated sequence of tokens is output as the response to your input.

The model doesn’t “know” the information or “select” words in a conscious manner. It generates responses based on patterns and statistical properties learned from the training data, predicting what comes next in a sequence in a manner that is probabilistically similar to the sequences observed during training. This allows it to generate relevant and coherent responses to a wide range of inputs, even without access to the training data or specific examples from it.

Q. Where does your lexicon or dictionary come from if you don’t select words? How are you able to compose any coherent sentences without a repository of words to compose with?

ChatGPT:

The model does indeed have a “vocabulary” or a set of tokens (words, parts of words, or characters) that it recognizes and can use to generate responses, which is built during the training process. Here’s a simplified explanation:

Vocabulary Building:

Tokenization: During training, the text data is tokenized into smaller pieces, which can be as small as one character or as long as one word (e.g., “a” or “apple”).
Vocabulary Creation: From this tokenization, a vocabulary is created, which is essentially a list of all the unique tokens identified in the training data.

Model Training:

Pattern Learning: The model learns the patterns, structures, and probabilities of token sequences from the training data. For example, it learns that the token “apple” is often followed by tokens like “is” or “pie” but rarely “dog.”
Weight Adjustment: Through numerous iterations and adjustments, the model fine-tunes the weights in its neural network to predict the next token in a sequence as accurately as possible based on the training data.

Response Generation:

Input Processing: When you provide an input, the model processes it using its learned weights and the vocabulary to understand the context and predict the next token.
Token Prediction: The model predicts the next token by calculating the probabilities of each token in its vocabulary being the next one in the sequence.
Sequence Generation: The model generates a sequence of tokens (a response) by repeatedly predicting the next token and adding it to the sequence, using the newly extended sequence to predict the following token.
Output: The generated sequence of tokens is converted back into human-readable text, forming the model’s response.

The vocabulary and the learned weights in the neural network enable the model to generate coherent and contextually relevant responses without selecting words in a conscious or intentional manner. It’s all based on probabilities and patterns learned during training. If you have more questions or need further clarification, feel free to ask!

Q. Exactly how many tokens in number do you have to create a vocabulary with?

ChatGPT:

GPT-4, the model you’re interacting with, has a vocabulary size of 50,257 tokens. These tokens can be as short as one character or as long as one word. For instance, “a” is a token, and “apple” is also a token. The model uses this set of tokens to understand inputs and generate responses by predicting the probability of a particular token following a given sequence of tokens, based on the patterns it learned during training. If you have more questions or need further details, feel free to ask!

Q. Precisely, in technical detail, no vague terms or generalizations, how do you calculate the probability of the next token in a sequence or phrase?

ChatGPT:

Calculating the probability of the next token in a sequence in models like GPT-4 involves a deep understanding of neural networks, specifically transformer architectures, and the application of various mathematical functions. Here’s a more technical breakdown:

1. Embedding Layer:

Tokenization: The input text is tokenized into smaller pieces, which can be words or subwords, resulting in a sequence of tokens.
Token Embedding: Each token is mapped to a high-dimensional vector using embeddings learned during training.
Positional Encoding: Positional encodings are added to give the model information about the position of a word within a sequence.

2. Transformer Architecture:

Attention Mechanism: The transformer uses a self-attention mechanism, which allows it to consider other words in the input sentence when encoding a particular word.
Scaled Dot-Product Attention: This involves calculating the dot product of the query and key, which is then scaled and passed through a softmax function to obtain the weights of the values.
Multi-Head Attention: The model uses multiple attention heads to capture different types of relationships among words in a sequence.
Feedforward Neural Network: Each attention output is passed through a feedforward neural network (the same one for each position), followed by layer normalization.

3. Probability Calculation:

Softmax Layer: The output from the transformer block is passed through a softmax layer, which converts the logits (raw predictions) into probabilities for each token in the vocabulary.

where z(i) is the output logit for the token w(i), and the softmax function is defined as:

where V is the size of the vocabulary.

Probability Distribution: The softmax function outputs a probability distribution over the entire vocabulary, representing the likelihood of each word being the next word in the sequence.

4. Token Selection:

Sampling or Argmax: Depending on the use case, the next token can be selected by sampling from the probability distribution (useful for generating diverse text) or by selecting the token with the highest probability (argmax).

5. Sequence Generation:

Recurrent Prediction: The selected token is added to the sequence, and the process (steps 2–4) is repeated to generate the next token, continuing until a stopping criterion (like an end-of-sequence token or a maximum length) is reached.

This is a high-level overview of the process, and each step involves additional complexities and nuances. If you have further questions or need more details about a specific part, feel free to ask!

Q. In layman’s terms, , how do you calculate the probability of the next token in a sequence or phrase?

ChatGPT:

Sure, let’s simplify the process:

Imagine you’re trying to predict the next word in the sentence: “I would like a cup of ___.”

Understanding Context: First, the model looks at the words provided (“I would like a cup of”) and understands that the sentence is talking about something related to a cup.
Thinking of Possibilities: Then, it thinks about all possible words that might come next. It could be “tea,” “coffee,” “water,” etc. It considers all the words it knows (its vocabulary).
Weighing Options: The model recalls all the sentences it has seen during training and thinks, “How often have I seen each word come after ‘I would like a cup of’?” It gives a “weight” or “score” to each possible next word based on how often it has seen that word in a similar context.
Calculating Probabilities: It then turns these scores into probabilities. If “tea” was seen often in similar sentences, it gets a high probability. If “elephant” was rarely or never seen, it gets a low probability.
Making a Choice: Finally, the model picks the word with the highest probability as the most likely next word. If we’re generating creative text, it might also randomly pick one of the less likely words to add some unpredictability.
Continuing the Process: If we need to generate more words, the model adds the chosen word to the sentence and repeats the process to find the next word, and the next, until it finishes the thought.

In essence, the model uses its “experience” (training) of reading lots of text to make educated guesses about what word should come next in a sentence! If you have more questions or need further simplification, feel free to ask!

Q. Wht knd f tlrnce do u hv for msspllng?

ChatGPT:

I have a reasonable tolerance for misspelling and can often understand words even when some letters are missing or incorrect. However, the accuracy of understanding can decrease as the number of misspellings or omitted letters increases. It’s always best to try to spell words as accurately as possible to ensure clear communication. If you have a specific question or need assistance with something, please feel free to ask, and I’ll do my best to help!

Next in the series: Claude on Claude

Previous in the series: Bard on Bard

Afterword

For those more technically inclined, here are a few more detailed articles on how ChatGPT works under the covers.

“How Does ChatGPT Work? (Simple & Technical Explanations)”, Semrush Team, Oct 04, 2023, https://www.semrush.com/blog/how-does-chatgpt-work/

Stephen Wolfram (2023), “What Is ChatGPT Doing … and Why Does It Work?,” Stephen Wolfram Writings. writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work.

“Transformer Neural Networks: A Step-by-Step Breakdown”, Written by Utkarsh Ankit, Published on Jun. 28, 2022, https://builtin.com/artificial-intelligence/transformer-neural-network

AI Explains AI (part two)

ChatGPT on ChatGPT

Q. When generating step by step instructions for cooking filet mignon, how do you determine which words or sequences of words are appropriate for the response?

ChatGPT:

Q. If you don’t have access to training data, how are you able to select appropriate words to start with?

ChatGPT:

Q. Where does your lexicon or dictionary come from if you don’t select words? How are you able to compose any coherent sentences without a repository of words to compose with?

ChatGPT:

Vocabulary Building:

Model Training:

Response Generation:

Q. Exactly how many tokens in number do you have to create a vocabulary with?

ChatGPT:

Q. Precisely, in technical detail, no vague terms or generalizations, how do you calculate the probability of the next token in a sequence or phrase?

ChatGPT:

1. Embedding Layer:

2. Transformer Architecture:

3. Probability Calculation:

4. Token Selection:

5. Sequence Generation:

Q. In layman’s terms, , how do you calculate the probability of the next token in a sequence or phrase?

ChatGPT:

Q. Wht knd f tlrnce do u hv for msspllng?

ChatGPT:

Afterword

Written by Kevin Lynn Brown