Paper Explained: Language Models are Super Mario

7 min readDec 4, 2023

Why this Paper is Important

In the ever-evolving landscape of Artificial Intelligence, breakthroughs often come in various forms — sometimes as monumental leaps, and other times as subtle, yet powerful, shifts in our understanding and capabilities. The paper “Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch” represents one of these moments, where we gain a new understanding on how to improve large language models.

Imagine a world where AI can adapt, learn, and grow exponentially without the constraints of extensive retraining or the need for high-powered computational resources. This paper brings us a step closer to that reality. By introducing the concept of DARE (Drop And REscale), the authors have not only presented a novel technique but have essentially redefined the boundaries of what’s possible in the realm of language models.

What this Paper Presents

In the paper “Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch,” an approach is unveiled in the realm of AI, reshaping our understanding of language model enhancement. The paper introduces DARE (Drop And REscale), a technique that dramatically reduces delta parameters in Supervised Fine-Tuning (SFT) Language Models, enabling these models to absorb and integrate the capabilities of similar models without the need for extensive retraining or high-end computational resources. This method not only streamlines the process of enhancing language models but also opens up possibilities for creating more powerful, versatile AI tools in a sustainable and resource-efficient manner, marking a significant stride in AI development.

How It Works

The methodology outlined in the paper can be summarized as a four-step process: Reduction, Rescaling, Merging, and Evaluation (we always need some good old evaluation). Each of these steps plays a pivotal role in transforming and elevating the capabilities of Language Models (LMs). Let’s unpack each step to understand the ingenuity behind this approach.

Delta Parameters and Their Reduction

Delta parameters are the differences between a model’s fine-tuned parameters and its original, pre-trained parameters. These parameters are key to the model’s new abilities acquired through fine-tuning. The paper reveals that these delta parameters in both encoder- and decoder-based Language Models (LMs) are often highly redundant. Astonishingly, up to 90% of these delta parameters can be dropped without significantly impacting the model’s performance. In some cases, the performance even improves upon the removal of certain delta parameters. This finding is crucial because it suggests that a large portion of the changes made during fine-tuning may not be essential for the enhanced capabilities of the model.

Rescaling of Remaining Parameters

After dropping a significant portion of the delta parameters, the remaining ones are proportionally rescaled. This rescaling increases their weightage and impact, compensating for the dropped parameters. This step is vital to maintain the model’s enhanced capabilities despite the reduction in the number of parameters.

Merging Multiple Models

The approach to merging involves combining homologous models — models that have been fine-tuned from the same base model. This common origin is crucial as it allows for effective calculation and manipulation of delta parameters. By merging these models, the authors create a single model that inherits the capabilities of all the individual models. This process not only consolidates the strengths of each model but also creates a more versatile and powerful tool.

Evaluation Metrics

The evaluation of these enhanced models is conducted using various metrics tailored to specific tasks:

AlpacaEval4: Uses win rate, computed by how often a powerful LLM (like ChatGPT) prefers the outputs from the target model over those from Text-Davinci-003.
GSM8K and MATH: Evaluated by zero-shot accuracy in solving mathematical problems.
HumanEval5 and MBPP6: Adopt pass@1 for evaluation, indicating the fraction of generated code samples that pass unit tests.
GLUE Benchmark7: Utilizes different metrics for different tasks, such as Matthews correlation coefficient for CoLA, accuracy for SST-2, QNLI, and RTE, matched accuracy for MNLI, and accuracy and F1 score for MRPC and QQP. STS-B leverages Pearson and Spearman correlation coefficients.

Figure 2: Illustrations of DARE and merging models with DARE. DARE can achieve comparable performance with standard SFT when it removes 90% or even 99% delta parameters. Moreover, DARE is able to tackle the parameter interference issue when merging models and yield consistent improvements. We denote the abilities of different elements for math/code-related tasks at the top.

Figure 2 provides a clear visualization of the DARE process applied to language models. Initially, Supervised Fine-Tuning (SFT) results in a model equipped with overlapping, redundant skills. The DARE approach first streamlines this by pruning the excess, much like refining a kitchen staff from several sous-chefs to a single, more capable chef. The skill that remains isn’t just left as it is; it is enhanced or ‘upskilled’, elevating its effectiveness.

The next phase involves merging. Here, we take various SFT models that have undergone this pruning and upskilling process and combine them. This fusion results in a model that’s not just efficient in one area but is equipped with a diverse array of enhanced skills — from cooking and mathematics to coding, and more. All of this is achieved without the need for additional fine-tuning, showcasing the efficiency and versatility of the DARE method.

Key Insights

Here I’ve listed some interesting points from the paper.

Parallel to LoRA Adapters

LoRA adapters, in their essence, are small, targeted modifications inserted into a pre-trained model to adapt it to new tasks. They represent a focused and efficient way of recalibrating a model’s capabilities. Similarly, when language models are fine-tuned, they undergo a process akin to adjusting LoRA adapters. Instead of a comprehensive overhaul, specific parts of the model are fine-tuned, much like how LoRA adapters tweak certain aspects of a model’s functionality.

Drawing from this parallel, merging fine-tuned models becomes akin to combining the best aspects of different sets of LoRA adapters. By doing so, the resulting model inherits a wide range of capabilities, each fine-tuned for specific tasks. This approach significantly boosts the model’s versatility and problem-solving capacity, enabling it to perform a broader spectrum of tasks more efficiently.

Tolerance to Drop Rates in Relation to Model Size

An intriguing aspect of this approach is the relationship between the size of the language models and their tolerance to drop rates. Larger models demonstrate a higher tolerance, meaning they can function effectively even with higher drop rates. For instance, WizardMath-70B shows commendable performance with a 0.99 drop rate, a feat not mirrored by its smaller counterparts, WizardMath-7B and WizardMath-13B. This trend is also observed in the WizardCoder-Python series models. The hypothesis here is that larger models, due to their inherent robustness and capability, can learn a multitude of low-rank structures akin to LoRA by fine-tuning a relatively smaller set of parameters during Supervised Fine-Tuning (SFT).

Figure 3: Performance of various decoder-based LMs on AlpacaEval, GSM8K, and HumanEval.

Superior Performance of Merged Models

Another significant insight is that merged models, in some instances, outperform their single-model counterparts. This finding is surprising as it underscores the efficacy of the merging process. By combining the strengths of multiple models, the resultant merged model not only retains the individual capabilities but, in some cases, even surpasses them.

The intuition here is that when a model is fine-tuned for a specific task, such as coding, it might lose some of its proficiency in other areas, like general language understanding. This specialization can inadvertently narrow its capabilities. However, by merging this model with others that have been fine-tuned in different domains, you effectively reintroduce and reinforce those lost skills. For instance, combining a coding-specialized model with one fine-tuned for language understanding allows the merged model to perform better in tasks that require both coding and language skills. This merging process essentially broadens the model’s expertise, making it more versatile and effective across a wider range of tasks.

The Crucial Role of the Rescale Operation in DARE

The DARE (Drop And REscale) method plays a central role in this process. Its rescale operation is particularly crucial, aiming to maintain the overall expectations of the model outputs unchanged. This aspect of DARE ensures that while the model undergoes a significant reduction in parameters, its output quality and consistency remain intact. In evaluations, such as those shown in Figure 7 comparing DARE and DropOnly (dropping only delta parameters) on CoLA, MRPC, and STS-B, the importance of the rescale operation in maintaining performance levels is evident.

Figure 7: Comparing DARE and DropOnly on CoLA, MRPC, and STS-B on encoder-based LMs.

Merging for Diverse Abilities

The culmination of these insights leads to a transformative capability: merging multiple task-specific fine-tuned models into a single, more powerful model endowed with diverse abilities. This approach not only signifies a leap in model efficiency and capability but also opens up new horizons in the application of language models across various domains.

Summary

We’ve delved into the innovative DARE (Drop And REscale) process and its transformative impact on language models, as vividly illustrated in Figure 2. The journey begins with Supervised Fine-Tuning (SFT) models, which often develop redundant skills. DARE streamlines this by pruning unnecessary elements and enhancing the remaining skills, akin to transforming a sous-chef into a master chef.

The real magic unfolds in the merging phase. Here, various pruned and upskilled SFT models are combined, resulting in a model that boasts a diverse set of enhanced capabilities, ranging from cooking and mathematics to coding. This process achieves a level of efficiency and versatility in language models without necessitating additional fine-tuning.

The DARE process not only optimizes the way we enhance language models but also opens up a world of possibilities for future AI applications, marking new possibilities of AI development that is more versatile.

Hope you enjoyed and learned something new from this blog! Thanks for reading.