E24 : Does Fine-tuning LLMs On New Knowledge Encourage Hallucinations?

Published in

Research Papers Summarized

5 min readJun 9, 2024

Fine-tuning a LLM using new factual knowledge data degrades the performance of LLM by encouraging them to hallucinate

Paper Name : Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

Authors : Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, Jonathan Herzig

Please find annotated paper here.

Problem Statement :

Fine-tuning LLMs to learn new factual knowledge that was not part of the pre-training data is a common practise to impart unseen pre-training knowledge into LLMs
Studying the various possible impacts of such fine-tuning process is an important aspect to be studied as such fine-tuning process would be required to adapt LLMs to domain data.

Solution :

Fine-tuning a LLM using new factual knowledge data that the model has not seen during pre-training and evaluating the accuracy of the fine-tuned model on in distribution and out of distribution test dataset.

Approach :

The questions in the fine-tuning dataset is categorised into two categories - Known and Unknown
The Known category questions are further divided into three sub-categories namely HighlyKnown, MaybeKnown, WeaklyKnown
The questions are categorised using Sampling based Categorisation of Knowledge (SliCK).
‘N’ random samples are drawn from the training data with each sample containing ‘k’ examples as few-shots to prompt the LLM to generate the response.
The LLM is prompted to generate ’n’ response using greedy decoding strategy (temperature, T=0) and ‘n1’ responses using sampling decoding strategy (temperature, T=0.5).
The number of correct times/total number of times is defined as Pcorrect which helps to classify a question as Known and Unknown and further subdivide Known into HighlyKnown, MaybeKnown and WeaklyKnown

The training data thus created is used to fine-tune the LLM under different scenarios.
The chosen LLM model (M) is fine-tuned using (D) training data to obtain the fine-tuned model (Md).

Experimental Setup :

LLM - PaLM2-M
Dataset - ENTITYQUESTIONS - triplets with diverse set of relations from Wikidata
12 relations are considered for training and in distribution testing and 7 relations for OOD testing.
N - 10, k - 4, n - 1 (greedy decoding), n1 - 16
Total response for each training data point to classify the question - 170
10 (from decoding approach) + 160 (from sampling approach - 16 per each sample(N))

Observations :

To study the performance of LLM when fine-tuned using unknown samples, the model was fine-tuned using different X% of unknown examples and (100-X)% known examples in the training data
Irrespective of the % off unknown examples in the training data, the accuracy of the fine-tuned model degraded on the test data both with full convergence (fine-tuned for 50 epochs) and early stopping (fine-tuned for 5–10 epochs). The degree of degradation was little when using early stopping rather than fine-tuning the model till convergence. In general degradation grows with greater % off unknown data and more epochs.

When the unknown examples were removed from the training data (D) and fine tuned with remaining % of known data (Dknown), the performance of model was identical thus proving that unknown examples have neutral effects in the early stop (5–10 epochs) scenario. As Dknown reduces, there was considerable degradation in performance thus showing that unknown examples are harmful when trained for longer duration.

Inferences show that unknown examples tend to fit slower than known examples. This is the reason for the lesser performance degradation of models when fine-tuned using early stop with unknown samples.

Experiments show that a linear relationship exist between the training example used and the test accuracy.
When tested on OOD data (7 relations of ENTITYQUESTION), the model fine-tuned with more unknown examples continue to perform poor. This shows that using unknown data for fine-tuning can impact the performance of fine-tuned model on OOD data. This behaviour may not be evident when the model is fine-tuned using early stopping as from previous observations we see that the model tends to fit unknown examples slower.
Study was done to understand the performance of fine-tuned model when using only the individual data categories as training data for fine-tuning purpose.
It was observed that when considering only the HighlyKnown examples for fine-tuning, the model performs well on all other classes (MaybeKnown, WeaklyKnown, Unknown) of test data but interestingly does not perform as good as trained only using MaybeKnown examples both under early stop and convergence techniques.

This result is interesting and emphasis the need to include data on which the model has comparatively less knowledge, as part of training data of fine-tuning process.
All these experiments prove that during fine-tuning, the model tends to align more to the style or format of generating the response based on the examples in the training data. The model learns to align the style of the responses for the pre-trained knowledge rather than learning from new data introduced as part of fine-tuning.

Limitations :

The experiment was conducted only using one LLM (PaLM2). Hence the same behaviour needs to be studied using other LLMs as well to get concrete conclusions.
Though SliCK categorisation works for short form QA as used here, long form question answering using SliCK can be a challenge from perspectives like evaluating model performance.

Conclusion :

When the general notion in Machine Learning is that a pre-trained model tends to learn during fine-tuning process, this work poses a question about it.
Also, this in turn emphasises the need to understand the pre-training data before fine-tuning a model on specific data.

Written by Praveen Thenraj