Photo by WrongTog on Unsplash

E13 : Noisy Embeddings Improve Instruction FineTuning (NEFTune)

Praveen Thenraj
Research Papers Summarized
4 min readNov 5, 2023

--

Adding noise embeddings to the token embeddings during instruction-tuning a pre-trained model reduces overfitting and improves generalisation of LLMs

Paper Name : Noisy Embeddings Improve Instruction FineTuning

Paper URL : https://arxiv.org/pdf/2310.05914.pdf

Authors : Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu,Gowthami Somepalli , Brian R. Bartoldson, Bhavya Kailkhura,Avi Schwarzschild,Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein

Please find annotated paper here

Problem Statement :

  • With evolving improvements in LLM, the research and growth in this domain is more focussed towards scaling model sizes and improving benchmarks
  • Inclusion or the extent of inclusion of regularisation techniques as part of training or fine-tuning LLM models remains a question
  • This in turn might lead to overfitting of LLMs to their training data and impact the performance of LLMs

Approach :

  • Introducing random noise which is a well known regularisation technique, as part of the fine-tuning process helps to reduce overfitting
adding noise to the token embedding (X)
  • During fine-tuning of a pre-trained model on a dataset, a noisy embedding is added to the token embeddings of the model
  • The model is then fine-tuned on a dataset using the token embeddings which has noise embeddings embedded into them
  • A Uniform or Gaussian distribution of noise is generated by sampling entries each in range of [-1,1] and scaling by a factor (α/√Ld)
  • α - tunable parameter, L - sequence length of model, d - embedding dimension

Experimental Setup :

  • NEFTune was tried on two tasks - conversational and factual question answering
  • OPT - 6.7B, LLaMA-1- 7B, LLama-2–7B were primarily fine-tuned with and without NEFT
  • Instruction tuning datasets considered for finetuning - Alpaca, Evol-Instruct, ShareGPT, Open-Platypus
  • Evaluation datasets considered
    1. AlpacaEval dataset - a dataset containing 805 pairs of instruction and output responses
    2. OpenLLM Leaderboard datasets - ARC(multiclass classification), HellaSwag(commonsense NLI), MMLU, TruthfulQA(factual QA)

Observations :

  • OPT-6.7B, LLaMA-1-7B, LLaMA-2-7B models fine-tuned using NEFTune technique show significant improvements over the same models fine-tuned with standard fine-tuning procedures across all datasets like Alpaca, Evol-Instruct, Open-Platypus, ShareGPT.
  • The above performance gain was observed on AlpacaEval dataset. An average of 15.1% improvement was observed.
Performance comparison of models tuned with/without NEFTune on AlpacaEval dataset
  • NEFTune improves the performance of chat tuned models (LLaMA-2-Chat-7B) even more by 10%, in-spite the chat model has already been tuned using RLHF
LLaMA-2-Chat(7B) shows 10% compared to base model which has already been extensively tuned
  • No performance gains were observed on OpenLLM leaderboard datasets even after NEFTune
Performance comparison of models tuned with/without NEFTune on ARC,HellaSwag,MMLU,TruthfulQA datasets
  • This shows that NEFTune shows improvement in conversational abilities and answer quality whereas does not show improvement in other activities like reasoning
  • Models fine-tuned with NEFTune using QLoRA still show performance gains on AlpacaEval dataset but not as significant as model fine-tuned with NEFTune alone.
  • Analysis was done to understand the effect of NEFTune on overfitting, length vs diversity, impact of length in response quality, embedding similarity
  • Models fine-tuned with NEFTune showed greater training loss compared to models fine-tuned without NEFTune but lesser testing loss thus showing signs of overcoming overfitting issues.
Comparison of training and testing loss of models fine-tuned with/without NEFTune
  • Models tuned with and without NEFTune was tested on their training data to generate responses for training data instructions. NEFTune models show lesser ROUGE-L and BLEU scores compared to standard fine-tuned models.The hypothesis is that these measures consider same response words in same order against ground truth responses. This shows models fine-tuned with standard methods tried to memorise and replicate the training data responses.
  • NEFTune models show longer response generation which was not due to repetition of words but rather due to addition of details.
  • Standard fine-tuned models when prompted to generate longer responses show improvements in AlpacaEval dataset but still not as significant as NEFTune models.
  • Uniform noise embeddings in general led to shorter sequence responses compared to Gaussian noise embeddings, but the quality of responses generated were comparatively better. This shows that the generating longer response does not essentially mean quality response.
Gaussian noise embeddings generate longer responses but are qualitatively less compared to Uniform noise embeddings
  • Similarity of token embeddings with other tokens in vocabulary of the model before and after adding noise was compared. There was <0.4% flip in the embedding similarity.This shows that NEFT training does not change the semantic relationships between the tokens

Conclusion :

  • In an era where pre-trained LLMs are being fine-tuned for task specific activities, the inclusion or extent of inclusion of regularisation techniques remains a question.
  • Adding noisy embeddings to the token embeddings help LLM fine-tuning to overcome overfitting problem and thus avoids models from memorising the training data.
  • NEFTune taps the potential knowledge of pre-trained models to generate more quality responses.
  • But there are no clear evidences of why NEFTune works for conversational style tasks compared to factual or reasoning QA.

--

--