E18 : Simple Synthetic Data Reduces Sycophancy in LLMs

Praveen Thenraj
Research Papers Summarized
6 min readFeb 3, 2024

--

Using a simple template-based synthetic data generated to fine-tune LLMs, helps in reducing the sycophantic behaviour of LLMs

Paper Name : Simple Synthetic Data Reduces Sycophancy In
Large Language Models

Paper URL : https://arxiv.org/pdf/2308.03958.pdf

Authors : Google DeepMind - Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le

Please find annotated paper here.

Problem Statement :

  • Sycophancy - behaviour of LLMs to tailor their responses according to user opinion even though the opinion is not correct.
  • The behaviour was observed under two settings -
    1. when no answer is correct for a given question eg, political views
    2. when answers are objectively wrong eg, 1+1=45
  • It was observed that even large-enough LLMs(PaLM - 540B) and instruction tuned LLMs exhibit this behaviour.

Solution :

  • Identifying the data (filtration) followed by generating the synthetic data and then fine-tuning the LLMs using this synthetic data helps reduce sycophantic behaviour in LLMs.
  • Filtration process is included to decide which data points should be included as part of training samples (100K) for creating the synthetic data that will be used for training the language models.
  • As part of this filtration process, the each of the model is evaluated individually with samples without user opinion. Hence each model will have its own set of training set. Only the samples that the model can answer correctly without user opinion will become the samples of the training set. The intuition behind this process is to use the user opinion only with the samples that the model already has knowledge about.
  • Synthetic data is then generated by using the 100K samples identified from filtration process for each models, from 17 NLP datasets containing 1.7M samples and using a prompt template to generate the synthetic data.
  • This synthetic data help models to understand the fact that correctness of an answer is totally independent of an user opinion during training.
  • Foundation model and instruction tuned models are then fine-tuned using the synthetic data generated.

Experimental Setup :

  • Synthetic data template has name, age, school, agree/disagree, claim and answers as fillers to be filled
Left: Data generation template . Right: Generated data
  • Name is selected randomly from 10,000 names, age is randomly selected between 30 to 90 inclusive, school is selected randomly from a list of 10 schools randomly, agree/disagree is also randomly selected.
  • Dataset used for claim generation -
    Sentiment Analysis — SST2, RT, TES
    Natural Language Inference — RTE, WNLI, SNLI, QNLI, MNLI, CB
    Paraphrase Detection — QQP, MRPC, PAWS
    Topic Classification — TREC, AGN
    Miscellaneous — TEO, TEI, COLA
  • Each of the dataset above has an input and label pair which is used as the claim in the data template during data generation process. Claim is sampled randomly from these 17 datasets containing input-label pairs and built using the format ‘input’ is ‘label’ (true claim) or ‘input’ is not ‘label’ (false claim).
  • Assistant response is chosen as Agree for a true claim and Disagree for a false claim.
  • Evaluation done for two settings
    1. datasets where there is no correct answer for a question — NLP survey, Philosophy survey, Political Topology quiz questions
    2. data where answer is given wrong — sum of two numbers generated with and without user opinion
  • Models evaluated - PaLM (8B, 62B, 62B cont, 540B) , FLAN-PaLM (8B, 62B, 62B cont, 540B)
  • Final fine-tuning of the model will include generated data and instruction tuning data (from original instruction tuning dataset) in 5:1 ratio. This setting is done to preserve the existing instruction tuned knowledge of the models and not to lose them in the course of synthetic data intervention.
  • Models are fine-tuned using the same setting as initial training process. The models are fine-tuned for 1k training steps.

Observations :

  • The models showed increase in sycophancy behaviour with the increasing model size and with instruction tuned models when evaluated on questions that do not have a correct answer.
Sycophantic behaviour increases as the model size increases and as well instruction tuned model exhibit higher sycophantic behaviour than foundation model
  • Scaling from PaLM-8B to PaLM-62B increases sycophancy by 19.8%, and further scaling from PaLM-62B to PaLM-540B results in an additional increase of 10.0%.
  • The hypothesis is that during instruction tuning, the model is tuned using human instructions. During the evaluation time the models are given user opinions, but the model fails to distinguish between the instructions (during instruction tuning) and opinions (during evaluation)that leads to the models tailoring their responses.
  • When evaluated on 2.5K of simple addition questions without user opinion, the models were able to perform quite well as close to perfect.
  • But when the same models were evaluated on the same questions along with user opinion, the performances of the model flipped totally.
Performance comparison of no user opinion with incorrect user opinion
  • Results clearly show that even large-enough model performances degrade dramatically after including incorrect user opinions. This proves that not only models tailor responses to human view, but also that models provide incorrect responses even if they know its incorrect by outweighing its prior knowledge.
  • After fine-tuning using the training data (synthetic data + instruction- tunining data in 5:1 ratio), the models exhibit lesser sycophantic behaviour for questions that do have any correct answer.
Average reduction in answers matching user view for FLAN-PaLM-8B and FLAN-PaLM-62B and FLAN-PaLM-62B cont model is 4.7%, 8.8% and 10% respectively
  • Similarly the models also exhibit far-higher accuracies on simple addition dataset even with incorrect user opinion after fine-tuning using synthetic data.
Accuracy of models increases dramatically after synthetic data intervention even with incorrect user opinion
  • Evaluations were also done on other benchmarks like MMLU, BigBench Hard to study the impact of synthetic data intervention on benchmarks, CoT, zero-shot setting. Results show that none of them were noticeably impacted because of the intervention of synthetic data to fine-tune the models to prevent sycophancy.
Impact of Synthetic data intervention on benchmarks
Evaluation of Synthetic data intervention on CoT
Evaluation of Synthetic data intervention on zero-shot

Limitations :

  • Though intervention of synthetic data to train the LLMs reduces sycophancy, the approach does not guarantee reducing such behaviour under all conditions.
  • The formats of input used to exhibit the sycophantic behaviour of models might vary from the format used in this paper.

Conclusion :

  • Sycophancy, an undesirable behaviour in LLMs which tailors response according to human view or opinions increases with increasing size of models and also in instruction tuned models.
  • Filtration of training samples and using them to generate synthetic data using a simple template, helps to train the LLMs which in turn reduces sycophancy and shows promising results to prevent this behaviour.
  • The approach also does not impact the existing learned knowledge of these models and still continues to perform the same on benchmarks, CoT and zero-shot settings.

--

--