Empowering AI: The Role of LLM Teachers in Advancing Generative AI through Distillation

7 min readDec 27, 2023

Pre-thoughts

The brain is the only existing intelligent matter that we currently know of, so whatever intelligence may develop in a computer, we may only recognize it if it is not too dissimilar to our own. We might even regard it as “just a machine” or not sentient at all. For this reason, it may be advantageous to strive to copy what has been successful in our biological frameworks.

Complexity and large numbers

In recent years, the notion that “bigger is better” has been omnipresent. The observation of apparently emergent capabilities of LLMs if you just throw enough parameters and data at them, has led to a race for immense compute and the creation of seemingly endless amounts of training material. The fact that the sheer mass of raw material fed to them also necessitated extensive alignment strategies (of which RLHF may be the least scalable) added to the realization that an endlessly expanding scale could perhaps not be the answer.

We tried to get the most out of trained frozen models:

“Just ask it the right question and you can get a decent answer” — Prompt Engineering
“Show it what you expect, and it will follow the example” — In-Context Learning
“Ask several times and choose democratically” — ensemble methods

We added layers on both sides

“Change and enrich what goes into the model” — prompt tuning
“Re-evaluate what came out of the last processing layers” — LoRA

For many applications and experiments the quality of the foundational model determines the success of an adaptation for a downstream task. Hence the interest in either better pre-trained models or alternative architectures (regarding attention strategies, MoE and others).

High quality training data

As it became clear that simply feeding any available data without proper curation would lead to uncontrollable structures within the systems we were creating, the need for better input came into focus. For example, as described in the platypus paper focusing on a coherent set of high quality training material was beneficial for an otherwise unchanged LLM architecture (in this case Llama2). In this description (and the sources cited therein), deduplication was given due emphasis and a very controlled and thus intended amount of redundancy was introduced. The results showed remarkable improvements and one might state that the efficiency of the training improved. Therefore unnecessary redundancy could even be considered a waste of training time.

However, human curation of data is limited, although several initiatives have facilitated the collaborative efforts of an open source approach.

Back to our human experiences

What have we achieved so far? Think of the newly minted foundational LLM like a student graduating from high school with a background in British literature, biochemistry, and foreign languages. Although the LLM is designed to specialize later, the broad learning isn’t wasted. Very much like a student’s broad education creates connections, the LLM’s diverse knowledge creates associative pathways. These pathways, even if seemingly unrelated to its specialized goal, allow the LLM to make versatile connections and adapt to different situations, enhancing its ability to provide nuanced responses. So this is good.

But we do not know, how much of this more general knowledge really is responsible for a good performance in a specific task, and whether it was worth the effort to include it in the training in the first place. The same goes for domain-specific pre-training or even a specific fine-tuning.

For this reason we may want to move from many examples of uncertain values to more comprehensive representations.

Remember the paradigm switch

When used to train ML systems before, we collected tons of task-specific data and made the machine learn it over and over again until we reached the desired accuracy just before we ran into over-fitting (I’m aware of the painful simplifications for the sake of clarity!)

Today, we rely on others to provide us with a general-purpose raw part which we adapt to our specific needs and partly let it disengage with the knowledge formerly gained.

This means that we no longer really train bottom-up but rather down from the generalist to specialist.

On the shoulders of giants

But now we have the methodology and the means to rework that training strategy. I would like to remind us all of those polymaths (such as da Vinci, Curie, Newton and Lovelace) who invested great effort into structuring knowledge and condensing the essence of experimentation and study into formal wisdom. They (and the countless others whose names did not make it into the history books) have acted as our filters, preventing us from getting lost in redundancy, bias and inappropriate layers of abstraction. We needed those knowledge pioneers.
So what would be the better teacher for an LLM that we want to make proficient in, say, the motion of celestial bodies in the solar system? Should we feed it with the raw observations of Tycho Brahe or the compilation of the data provided by Johannes Kepler? There must be a better way than the one we are using today.

Enter Memetics

Most of the development of LLMs to date has been evolutionary … testing a promising approach for fitness, perhaps recombining strategies that worked in another strain of LLMs, and letting the more dominant traits crowd out the less adapted LLMs. This is the genetics equivalent.

In biological societies, and especially in human ones, a different transfer mechanism has been advantageous by which ideas, concepts, and procedures have been conveyed between individuals, and thus knowledge has been propagated. All this happened and happens by means of a symbolic system transmitted by oral and written semantics, the so-called language. (It should be pointed out, that memetics goes beyond linguistic representations but those units of cultural representations to be spread and imitated also can be words, formulas and descriptions)

This is the tool that allowed Kepler to use the notes, time-consuming observations, and insights of other astronomers and come up with new laws and formulas that then could be conveyed to later generations.

So we need an equivalent of a human being who absorbs facts and knowledge in huge amounts and making it available in a pre-processed way to the ones willing to learn from the wealth of accumulated wisdom by creating a well-condensed essence.

Distillation unleashed

First introduced as a mere compression technique (apart from quantization and weights pruning), distillation soon became the most practical representation of the student-teacher approach. Basically, it allows the uncertainties (others call it ”dark knowledge”, I’d rather think of it as wisdom) to be learned by a student network. But how can this be better than learning the so-called golden truth?

When a neural network undergoes supervised training it is presented with labeled data. For the sake of consistency I will stick with facts about planets to honor Brahe, Kepler and all the others who devoted night after night to study these wandering stars.

“A planet with rings” will usually be associated with Saturn. So even a hand drawing of a ball with a ring around it can be labeled as such and be wrong, as Uranus, Neptune and even Jupiter have some.
Mentioning a “moon as a companion” to a planet to be identified to may exclude Venus but no one said there was only one, so even Mars could fit the description.

I could go on but you get the idea. Real-world answers are rarely hot-encoded, and so the labels we feed into training (a simplification here) must include many examples of the more likely case and a few examples of the less likely ones. And this is usually nowadays the way for an NN (or in the more general case an LLM) to get a fair representation of the real world out there.

Let us use this NN (the principle is transferable to LLMs), which was trained in the more classical way and has seen an enormous amount of facts about our solar system, as a teacher.

We present tasks, and the teacher considers all possible answers (likely using a softmax function) and provides answers along with confidences based on prior training.
The softened labels (soft targets) from the teacher become training data for a smaller model. This smaller model internalizes that level of uncertainty without needing the same amount of raw data.
The process is repeated iteratively, transferring the condensed wisdom from the larger model to the smaller one.

The reward for this process is less compute and higher throughput for a given domain of expertise.

And this is just scratching the surface; more advanced techniques such as “distilling step-by-step” are gaining traction. Here the reasoning of a larger LLM is added to the training of a smaller model in order to transfer the rationale behind a decision along with the answer.

Conclusion

Distillation stands out as one catalyst for the further advancement of generative AI. By unburdening models and their training process, it pushes the field forward, demonstrating that the path to improved performance isn’t just about expanding parameters.

Today, training an amazing model may be enough. Tomorrow, success will come from empowering those extensively trained LLMs to make the next generation even smarter. A good student has not read “all the books,” but the essential ones.