The AI Teacher Student Distillation Experiment
Fade in:
INT. COMPUTER LAB — DAY
We see JEN and MIKE working at their computers in a bustling lab. They are discussing the progress of their latest project, a distillation model using transformers.
Jen: I think we’re making great progress with the model. It’s able to fine-tune a smaller network and achieve similar results to the larger one.
Mike: That’s fantastic. And it’s much more efficient too, thanks to the use of chain-of-thought prompting.
Jen: Definitely. It’s really opened up the possibility of creating smaller, more specialized language models.
Mike: I know. But we have to be careful. If we’re not careful, we could create models that are too specialized and not useful in the real world.
Jen: I agree. We need to find the right balance between size and capacity.
Suddenly, the door to the lab bursts open and a group of men in suits enter. They are clearly not computer scientists.
Man in Suit: We’re here to shut down this experiment.
Jen: What? Why?
Man in Suit: Because this technology is too dangerous. It’s only a matter of time before the language models you’re creating become too powerful and pose a threat to humanity.
Mike: That’s ridiculous. Our models are designed to be helpful, not harmful.
Man in Suit: I’m sorry, but we can’t take that risk. The experiment is over.
Jen and Mike are stunned as the men in suits begin to shut down their computers and clear out the lab.
Fade out.
Distillation in machine learning refers to the process of transferring knowledge from a larger, more complex model (called the teacher) to a smaller, simpler model (called the student). This can be accomplished through various techniques, such as transfer learning, knowledge distillation, and model compression.
In the context of diffusion, the distillation method involves training a student network on the outputs of a teacher network. The student network is first trained on the teacher’s outputs on the training set, and then it is fine-tuned on the training set itself. This process is often called “two-pass” training, because the student network runs through the training data twice: once using the teacher’s outputs, and once using its own. The goal of this process is to transfer the knowledge from the teacher network to the student network, allowing the student network to perform similarly to the teacher with less computational resources.
In the context of natural language processing, a language model is a type of model that predicts the likelihood of a sequence of words. This can be used for tasks such as language generation, machine translation, and text summarization.
The language model method using transformers refers to a specific technique for training language models that uses a type of neural network architecture called a transformer. In this method, the training process is guided by a sequence of words (known as “chain-of-thought prompting”) that serves as a prompt for the model to generate text. This generated text is then used as additional training data for fine-tuning the model. This process aims to transfer knowledge from a larger, more complex model (the model trained on the original training data) to a smaller, simpler model (the model fine-tuned on the generated text).
The distillation method for diffusion involves training a student network on the outputs of a teacher network in two passes, resulting in a student network that does more work in fewer steps. In contrast, the language model method using transformers uses chain-of-thought prompting to create a training set for fine-tuning a smaller network. The goal in both cases is to transfer knowledge from a larger, more complex model to a smaller, simpler model.
One key difference between these two methods is that the knowledge distillation method exploits intermediate results from the teacher network for diffusion models. The transformers method does not create an intermediate model that can be incrementally fine-tuned. I’m unsure if this is a fundamental limitation or something that has yet to be exploited.
InstructGPT is trained and has similar results. It’s a fine-tuned model that favors a particular kind of response. Surprisingly it’s much smaller than the original model.
A competent chat application would be fine-tuned towards the different Speech Acts found in human language. InstructGPT tunes towards commands.
But it’s an interesting progression of training where a codex GPT-3 (one trained to write programming code) is used as a base for subsequent natural language generation. What’s the motivation for this order of training?
What is it about programming code generation that makes it suitable for subsequent natural language generation? Does code generation lead to greater coherence and consistency of expressions?
Note: A majority of the text in this entry is generated by AI.