Photo by Lance Grandahl on Unsplash

E22 : TIES Merging

Praveen Thenraj
Research Papers Summarized
7 min readApr 26, 2024

--

Retaining only the top-k% influential parameter values and electing the signs for these values and then calculating mean, when merging multiple fine-tuned models help improve model merging performance rather than just a simple average or addition techniques

Paper Name : TIES-MERGING: Resolving Interference When
Merging Models

Paper URL : https://arxiv.org/abs/2306.01708

Authors : Prateek Yadav, Derek Tam,Leshem Choshen, Colin Raffel, Mohit Bansal

Conference : NeurIPS (2023)

Please find attached the paper here.

Problem Statement :

  • Pre-trained models in general work well, but requires fine-tuning depending upon the downstream task and for domain adaptation.
  • Fine-tuning individual models for every downstream tasks can become costly in terms of training, storing, deploying and also maintenance.
  • Multi-task learning - a technique of fine-tuning one model on multiple task, can be a viable and efficient option to handle this scenario but involves complexity like model training cost, expensive compute cost in the process of identifying the right set of training data
  • Given the size of the models being developed these days, fine-tuning models for every downstream tasks or having an ensemble of multiple models for a single task or single multi-task model can be super costly
  • Existing model merging techniques do not consider the interference caused due to the impact of redundant parameters on influential parameters during merging and interference due to the disagreement between the signs of the parameter being merged

Solution :

  • Merging the models fine-tuned for different downstream tasks into a single multi-task model or merging the different models fine-tuned for single downstream task without any further training is called model merging.
  • Rather than merging by just taking average or by just adding the parameters of the different models to be merged, TIES uses trimming, elect sign, and merging as the steps to merge different models.

Approach :

  • TIES - Merging comprises three steps - TRIM, Elect Sign, Merge
TIES - TRIM, Elect Sign, Disjoint Merge
  • Assume that there are ’n’ fine-tuned models for ’n’ different tasks. Let a parameter of a fine-tuned model be represented as ‘θ’. Assume θinit is the parameter of a pre-trained model and θft is the parameter of the fine-tuned model.
  • Given a parameter ‘θ’, the task vector (τ) of the parameter ‘θ’ is calculated for each fine-tuned model, by subtracting the fine-tuned model parameter from the corresponding pre-trained model parameter.
θft - task specific fine-tuned parameter, θinit - pre-trained model parameter
  • For a given task ‘t’, if there is a difference between the fine-tuned model and pre-trained model parameter value (p=1,2,….d - where ‘d’ is the dimension of the parameter) is greater than zero (τ>0) of a parameter ‘θ’, then that parameter value is regarded as an influential value for that particular task. In case if there is no difference (τ=0) then it is considered as a redundant value.
  • TRIM - Retain only top-k% (largest) of the influential parameter values of a task vector and resetting other values to zero. By this way, the redundant values in the parameter ‘θ’ are reset to zero.
  • Elect Sign - Given a particular parameter ‘θ’, the sign is elected by adding +1 for every positive parameter value and -1 for every negative parameter value. The magnitude of the values (summed value) for the positive sign and negative sign are compared and the sign that accompanies the highest magnitude (summed value) after adding will be taken as the final sign for that particular parameter.
  • Disjoint Merge - Given the task vector (τ), only the values in the top k% influential parameter values, that has the same sign as the sign selected in Elect Sign step is taken to calculate the mean. This gives the new task vector (τm)
  • Once the new task vector is computed, the updated parameter ‘θ’ of the merged model, is calculated using the below formula.
θm - merged model parameter, θinit - pre-trained model, τm - updated task vector (calculated using TIES Merging)

Experimental Setup :

  • TIES was evaluated against three baseline settings - task specific fine-tuned models, task specific fine-tuned models merged using other merging techniques and task specific models fine-tuned using PEFT techniques.
  • Text datasets used for fine-tuning - Question answering (QASC, WikiQA, and QuaRTz), Paraphrase Identification (PAWS), Sentence Completion (Story Cloze), and Coreference Resolution (Winogrande and WSC).
  • Vision datasets used for fine-tuning - Cars, DTD, EuroSAT, GTSRB, MNIST, RESISC45, SUN397, and SVHN
  • Individual NLP models fine-tuned on text datasets - T5-Base, T5-Large
  • Individual Vision models fine-tuned on vision datasets - ViT-B/32, ViT-L/14
  • Fine-tuned NLP models considered for merging - T5-Base, T5-Large
  • Fine-tuned Vision models considered for merging - ViT-B/32, ViT-L/14
  • PEFT fine-tuned model considered for merging - T0 - 3B
  • Other merging techniques evaluated - RegMean, Fisher Merging, Task Arithmetic

Observations :

  • TIES method outperformed both task specific fine-tuned individual models and merged models created using other merging techniques across both text and vision datasets.
Comparison of PEFT vs Full fine-tuning vs Other merging techniques vs TIES merged text and vision models
  • TIES outperformed models fine-tuned using PEFT method and also the same models merged using other merging techniques.
  • When evaluated on out of domain dataset that was as not part of fine-tuning the models, T-5 fine-tuned models merged using TIES outperformed same fine-tuned models merged using other merging techniques like RegMean, Fisher Merging, Task Arithmetic.
Performance of TIES on out of domain data compared to other merging techniques
  • As the number of tasks increases during merging, the performance of the models merged using TIES decreases comparatively slower than models merged using other techniques like Task Arithmetic and Simple Averaging. This clearly shows that when merging models, the performance of the merged model is impacted by the interference. Since TIES approach handles the interference, the performance degradation is little slower compared to other techniques
Performance of TIES merging when combining models fine-tuned for different tasks
  • When merging multiple models fine-tuned on the same task, models merged using TIES show more robustness and improvement in performance compared to the same fine-tuned models merged using other techniques like Averaging, Fisher, Task Arithmetic and even ensembling.
  • When different models fine-tuned on different tasks are merged using TIES and taken as the initialisation model and then fine-tuned on another downstream task, the performance of the fine-tuned model that uses TIES merged model as the initialisation model, outperforms other techniques like Task Arithmetic and Averaging. It even outperforms the pre-trained model initialisation by large margin. This shows that models merged using TIES can be a good starting point for downstream tasks.
  • Results show that when flipping p% of the signs of the top-k% influential parameter values in the task vector of the TIES merged model impacts the performance of the TIES merged model. Considering a probability of 20–60% sign flips in the influential parameters, decrease the performance drastically. On the other hand increasing the flipping signs of the (100-k%) - redundant parameters does not impact the performance of the TIES merged models much.
  • The above experiment clearly shows the interference of redundant parameters with influential parameters value is significant in reducing the overall performance of merged model obtained using TIES. This shows need to avoid interference of redundant parameter values with influential parameter values during model merging.
  • Results also show that when combining the parameter signs of the multi-task fine-tuned model (considering them as oracle sign) and taking the parameter values obtained using TRIM and Disjoint Merge steps of TIES and creating a merged model almost performs closer to the performance of the multi-task fine-tuned model. This clearly shows the importance of considering the signs when merging models.
  • All these results clearly show that the performance of model merging will be impacted due to the interference of the redundant parameter values with the influential parameter values and the interference due to disagreement between the parameter signs

Conclusion :

  • Given the evolution of the LLMs and their sizes, techniques like model merging without any training, can be a game changer compared to conventional techniques like task specific fine-tuning or multi-task fine-tuning.
  • Compared to other merging techniques available, TIES stands out due to the fact that it considers the interference that can be caused by the parameters during the model merging.
  • Results show that interference due to redundant values impacting influential values of a parameter and interference due to sign disagreement between the parameters play a pivotal role in impacting the performance of the merged model.

--

--