E25 : The Unreasonable Ineffectiveness of the Deeper Layers

Published in

Research Papers Summarized

6 min readJun 16, 2024

Pruning ‘n’ deep layers in LLMs and healing it with a QLORA fine-tuning tries to match the efficiency of unpruned LLMs thus reducing the compute required for fine-tuning and memory bandwidth required for LLMs at inference time.

Paper Name : The Unreasonable Ineffectiveness of the Deeper Layers

Paper URL : https://arxiv.org/pdf/2403.17887

Authors : Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts

Please find the annotated paper here.

Problem Statement :

The capabilities of language models have improved drastically in the past few years which can be attributed to the increase in size of these models.
Increase in model size in turn means increase in the compute resources required to train and fine-tune these models. It also means more memory and compute requirement at inference time as well.
Parameter efficient techniques (PEFT) like LoRA, distillation techniques help reduce the model parameters during fine-tuning but does not reduce the memory and compute requirements at inference time.
All these factors make LLMs face challenges in being deployed as part of the solution to real time production use cases.

Solution :

Removing ‘x’ layers of the LLMs that are redundant - whose output representations are similar to the previous layer output representations and does not have much impact on the overall performance of the model can help reduce both compute and the memory requirements of a model during fine-tuning and inference as well without degrading the model performance considerably.
Pruning the model and then healing the model by fine-tuning using a PEFT technique like QLoRA even pushes the value of ‘x’ (number of layers that can be pruned) by further value before we notice sharp transition in the performance.

Approach :

A sample of the pre-training data or the downstream task data is taken to experiment the pruning of LLM layers.
In a LLM, the output of a layer ‘l+1’ is a function of the input (x(ℓ))to that layer (l+1) from the previous layer ‘l’ and the parameter vectors for the layer (ℓ)

output of layer ‘ℓ+1’ is dependent on the output of layer ‘ℓ’ and the parameter vectors for layer ‘ℓ’

Structured pruning is done by removing ’n’ layers of a LLM that has ‘L’ layers in total.
Given a layer ‘l’ and the number of layers ’n’ to prune after ‘ l’, the distance between the input representations of each layer after ‘l’ (including ‘l’) and the input representations of the layer ‘l+n’ is calculated.
The input representation of a layer (between ‘l’ and ‘l+n’) with the least distance when compared with the input representation of ‘l+n’ is marked as ‘l*’ .
All the layers starting from ‘l*’ till ‘l+n’ are dropped since the intuition is that if there is not much difference in the input representations of these two layers then these layers do not add significant importance to the model.
After pruning, connect the old input to layer ‘l*’ to ‘l*+n’ layer as input.
Optionally, this mismatch is healed by fine-tuning the model using QLoRA.
The healing step is an optional step and can be included using the pre-training data or the sample from downstream task data for fine-tuning the pruned model.

Experimental Setup :

LLM evaluated - Llama-2–7B, Llama-2–13B, Llama-2–70B, Qwen-7B, Qwen-14B, Phi2–2.7B, Mistral-7B
PEFT technique used - QLoRA
LoRA ranks used :
64 - Llama2–7B, Qwen-7B, Qwen-14B, Phi2 - 2.7B)
2 - Llama2–7B, 8 - Llama2–70B, 4 - Mistral-7B
Benchmarks evaluated - QA evals - MMLU, BoolQ, Loss on next token prediction - C4 (Colossal Clean Crawled Corpus)

Observations :

Experiments conducted reveal that the performance of the pruned models (without healing) remain robust and inline with unpruned model performance until approx. 45–55% for Llama family models, 20% for Qwen models, 35% for Mistral and 25% for Phi models.

Performance comparisons of different LLMs based on fraction of layers dropped

After certain fraction of layers dropped, the models that were not healed started noticing sharp transition or drop in their performance towards the random guessing accuracy.
The same models when treated with QLoRA fine-tuning were able to further push the number of pruned layers compared to non-healed models. The performance degradation in healed models occurred after comparatively more fraction of layers removed.
When predicting the next-token prediction loss using cross entropy loss on C4 dataset, the loss of non-healed models started showing sharp transition towards random guessing accuracy much earlier compared to the healed models.
Results also show that performance of smaller LLMs tend to degrade quicker than comparatively larger models. The possible intuition could be these models are already overtrained and do not contain redundant layers as they are designed to be smaller models.

On the other hand, the performance of the healed models were gradually decreasing and also allowed more layers to be pruned compared to non-healed models.
A heat-map study of the angular distance between the layer number and the number of blocks to be removed show that the deeper layers exhibit more similarity than the shallow layers. Also this study helps to conclude that the final layer should never be pruned as in all the cases the final layer shows maximum dissimilarity.

Comparision between layer number and block size for similarity of representations

Based on the above study, a simple pruning heuristic was arrived to remove L-n layers of the deep layers of a LLM and also never prune the last layer. With this heuristic it would not require to load the models onto GPU to understand which layers to prune in a given LLM rather heuristically remove L-n deep layers directly.
Performance degradation without healing was drastic when Llama2–70B was pruned using this simple heuristic rather than using similarity method for pruning. It was observed in both accuracy and loss on MMLU,BoolQ and C4 evals respectively

Evaluation of simple heuristic pruning vs similarity informed pruning using Llama2–70B model

The same Llama2–70B model was pruned using the simple heuristic and then healed using QLoRA which helped the model performance degradation to be shifted and be inline with the performance degradation of model pruned using similarity informed pruning and healed using QLoRA
Experiments also showed that using lesser LoRA rank helped the models to induce regularising effect during fine-tuning and thus reduce overfitting effect. This can be further confirmed by the improved MMLU accuracy and increased C4 validation loss when using lesser LoRA ranks compared higher ranks.
Results also show that the models performance remains robust when changing the order and as well the examples of the few-shot prompts used for MMLU and BoolQ evaluation. In general, changing the order or examples in few-shot prompts causes changes in the performance of the models.

Conclusion :

With emerging trends of PEFT to effectively fine-tune LLMs and make them accessible with lesser inference times, combining pruning technique along with PEFT as healing technique can help reduce the overall compute requirement of fine-tuning and compute and memory requirement at inference time.
This work discusses about the performance of model on QA tasks only, whereas the performance on other downstream tasks also needs to be evaluated to understand the efficiency of the proposed approach.
Evaluation on other tasks will also help to understand what other layers of the LLMs would be needful to store the knowledge about the task.

E25 : The Unreasonable Ineffectiveness of the Deeper Layers

Written by Praveen Thenraj