Lookback-lens: Detect and Mitigate Hallucinations in LLMs with Attention Maps
When summarizing articles or answering questions for a given passage, LLMs can hallucinate details and respond with inaccurate or unsubstantiated details for a given context, which are also referred to as contextual hallucinations. To address this authors of this paper[1] propose a simple hallucination detection model whose input features are given by the ratio of attention weights on the context versus newly generated tokens (for each attention head). It is called as Lookback lens or lookback ratio-based detector.
Overview
- propose a simple feature called lookback ratio, which is computed as the ratio of attention weights on the given context versus the newly generated tokens.
- At each time step, we calculate this lookback ratio for each attention head, and train a linear classifier, which we call the Lookback Lens, to detect contextual hallucinations based on the lookback ratio features, as illustrated in figure below
- further integrate this detector during decoding to derive a Lookback Lens Guided Decoding strategy which can reduce contextual hallucinations
Contextual Hallucinations Detection
i) Lookback Lens
- introduced a lookback ratio, a measure based on the attention distribution of a transformer model
- Given a transformer with L layers, each with H heads, the model processes an input sequence of context tokens X = {x1, x2, . . . , xN} of length N followed by a set of newly generated tokens Y = {y1, y2, . . . , yt−1} to generate the next token yt.
- For time step t, and for each head, we calculate the ratio of attention weights focused on the context tokens versus the newly generated tokens
- for each head h in layer l, we define:
where αl h,i and αl h,j are softmax-ed attention weights assigned to context tokens X and new tokens Y respectively
- lookback ratio for head h in layer l at time step t is then calculated as
- lookback ratios as input features are utilized in detecting hallucinations by concatenating the lookback ratios across all heads and layers into a feature vector for the time step t:
- Given a text span of interest {yt, yt+1, …, yt+T−1}, we average the corresponding lookback ratio vectors {vt, vt+1, …, vt+T−1} into a single vector ¯v. We then employ a logistic regression classifier F to predict if the span is factual (1) or hallucinated (0) based on the averaged lookback ratio vector.
where σ denotes the sigmoid function, w is the weight vector, and b is the bias term of the classifier.
a) Defining Span
Lookback Lens predicts the probability of hallucinations over spans. Authors considered following two ways to obtain spans for a given sequence
- Predefined Spans: When the hallucinated and non-hallucinated span annotations are available, we directly train the classifier to differentiate between them.
- Sliding Window: Since there are no predefined spans during decoding, a sliding window setup is used that iterates over all possible spans. Specifically, we process the sentences into fixed-sized chunks and train the classifier to predict a label of 0 if any hallucinated content exists within a chunk, and 1 otherwise.
ii) Experimental Setup
a) Data
- Training required labels for hallucinated and non-hallucinated examples
- Examples generated by first prompting LLaMA-2–7B-Chat to greedy decode responses for 1,000 summarization examples from the CNN/DM dataset [2] and 2,655 QA examples from the Natural Questions [3] following the setup of [4].
- Then GPT-4o was used to verify the truthfulness of these responses and provide span-level annotations on hallucinated segments
- Also a pilot study done of human annotation on a subset of 70 examples of the summarization task , confirming a 97% consistency rate between GPT-4o annotations and human judgments, and validating the reliability of the automated annotations
- Table below shows the Dataset statistics and GPT-4o evaluation results on responses greedy decoded by LLaMA-2–7B-chat
- results show that the generated summaries from LLaMA-2–7B-Chat still exhibit hallucinations about half of the time, highlighting the challenge of summarization tasks
b) Baselines
- Text-based entailment classifier: We fine-tune the DeBERTa-v3- base (He et al., 2021) model on the same dataset of CNN/DM and NQ as a natural language entailment (NLI) task
- Hidden states based classifier: We train classifiers using the same setting as the Lookback Lens but used input features from the hidden states of LLaMA-2–7BChat from its 24th, 28th, and 32nd layers instead of the lookback ratio
iii) Results
- Table below shows the AUROC of the classification tasks using predefined span segmentation and sliding window (size = 8) on NQ (QA) and CNN/DM (Sum.). The source task scores (Train/Test) are averaged over two-fold validation
- Lookback Lens achieves slightly better performance than the hidden states-based classifier and significantly outperforms the NLI models (SoTA and our impl.)
- advantage of the Lookback Lens over the hidden states-based classifier is more significant in the sliding window settings
- hidden states based classifier tends to overfit the training sets during the two-fold validation, and present a substantial performance drop when transferred to out-of-domain tasks
- Lookback Lens, while not always fitting the training set perfectly, consistently exhibits better performance when applied to out-of-domain tasks
Contextual Hallucinations Mitigation
i) Lookback Lens Guided Decoding
- incorporates Lookback Lens (F) into the decoding process
- F can evaluate multiple-token chunks, as each chunk causes different attention patterns in multiple decoding steps
- Given the context and partially generated text, we independently sample a set of k candidate chunks {C1,C2, . . . ,Ck} at the same decoding step t.
- For each chunk Cj , the associated lookback ratios are averaged to form a feature vector ¯vj
- As shown in figure below, authors select the best candidate C∗ predicted by F and append to the generation
- Authors repeat this process until it generates the EOS token or reaches the maximum length
ii) Experimental Setup
a) Datasets
- Natural Questions [3] data used following the setup of [4]
- To test the Lookback Lens’s effectiveness at transferring across data distributions for the same task (summarization), we use 1,000 examples sampled from the testing set of XSum
- MT-bench [5], a multi-turn instruction-following benchmark covering 8 categories
b) Baselines
- Greedy Decoding: generating responses using the LLaMA-2–7B-Chat model through greedy decoding
- Other Classifier-Guided Decoding: using exactly the same setting but with different classifiers , including text-based entailment classifiers and hidden states-based classifiers
iii) Results
- Table below shows decoding results using 8 candidates per chunk in a chunk size of 8
- Lookback Lens Guided Decoding can improve the performance on both in-domain task (XSum, by 9.6%) and out-of-domain tasks (NQ, by 3%)
- result is on par with using SoTA NLI to guide the decoding, where SoTA NLI is trained on roughly 731k annotated summarization examples, which is 700× larger compared to our 1k training set
- decoding guided by hidden states-based or the NLI (our implementation) classifiers, both trained on the same data of our method, can only slightly improve the performance on NQ, but not for XSum, probably due to the issue of distribution shift, highlighting the advantages of Lookback Lens in generalization ability
- decoding method can boost the performance on the hallucination setting while maintaining the same performance in the original setting, which shows that our decoding method is effective in reducing hallucinations without compromising the overall generation quality
Cross-model Transfer
- lookback ratio used to capture higher-level model patterns for hallucination detection highlights its potential to better transfer across models
- A classifier trained with one model’s lookback ratio could potentially be applied to another model without retraining, provided correlation between the target model’s attention pattern and that of the original model
- Lookback Lens trained on attention maps can be transferred from LLaMA-2–7B-Chat to LLaMA-2–13B-Chat without any retraining
- Table below shows Cross model transfer results on detection tasks.
- Although cross-model transfer yields slightly worse results compared to same-model transfer, the AUROC scores are still non-trivially high
- Table below shows Cross model transfer from LLaMA-2–7B-chat to LLaMA-2–13B-chat using greedy decoding and classifier guided sampling methods with chunk size 8.
- observe a performance improvement similar to same model transfer using 13B itself, or using the SoTA NLI model applied on the 13B decoding
- However, on cross-task + cross-model transfer settings: CNN/DM (7B) to NQ (13B), we do not observe significant improvements where we attribute to the larger distribution shift
Discussion
i) Effect of Chunk Size
- As shown in earlier results, there is a slight trend that Lookback Lens guided decoding prefers shorter chunk size for NQ and longer chunk size for XSum
ii) Predictive Power of Different Heads
- Table below show the results on detection tasks achieved by different detectors trained using only a subset of top-k heads with the largest magnitude of coefficients in the original Lookback Lens trained will all heads
- results show that the predictive power is not concentrated only on a subset of heads.
- Using only top-10 heads is worse than using all heads, and increasing k consistently improves performance and top-100 heads largely recover the model’s performance using all heads
iii) Reducing Number of Layers
- Table below shows Cross-task transfer AUROC among layers
- As observed in results, predictive power is not concentrated in any subset of layers, as none of them can recover the performance of the full model that uses all layers.
- However, we observe that the middle layers (13–16, 17-20) are slightly more useful than other layers.
iv) Qualitative Study
- Figure below show qualitative examples from XSum to illustrate that how Lookback Lens guided decoding improves performance
- Greedy decoding from LLaMA-2–7B-Chat results in a hallucination, i.e. $100m (£64m), that does not exist in the input document
- However, the Lookback Lens is able to assign low scores for the chunk candidates that have contextual hallucinations (as marked in red)
Limitations
- performance upper bound of Lookback Lens Guided Decoding is limited by the sampling capabilities of the LLM itself
- although the Lookback Lens is a lightweight classifier with negligible inference time, the requirement to sample multiple candidates from the LLM increases the total inference time
- Lookback Lens relies on annotated examples of around 1k-2k to train the classifier.
Conclusion
- introduce the Lookback Lens, a lightweight classifier designed to detect contextual hallucinations by utilizing the lookback ratio, which is computed solely from attention weights
- This classifier not only effectively identifies contextual hallucinations but also mitigates them through Lookback Lens Guided Decoding from the LLM
- the method is transferable across various tasks, and even across models after mapping their attention heads
References:
- Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps by Chuang et al.
- Get to the point: Summarization with pointer generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics by See et al.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics by Kwiatkowski et al.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics by Liu et al.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems by Zheng et al.