Understanding Hallucinations in Language Models: Statistical Roots and Evaluation Incentives

2 min read2 days ago

Language models often produce statements that sound plausible but are false. These “hallucinations” reduce trust. What if many hallucinations aren’t just flaws of training data or architecture, but are statistical intrinsic model properties that get incentives in evaluation that in turn reward guessing over admitting uncertainty?

In Why Language Models Hallucinate (Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, Edwin Zhang; 2025), the authors argue that hallucinations naturally arise in standard language model training pipelines, both during pretraining and post‐training. Hallucinations are analysed through the lens of binary classification (valid vs. error outputs). Even with error‐free training data, statistical pressures make some error rate inevitable. Moreover, because many benchmarks and evaluations reward confident guesses and penalize abstentions or uncertainty, models have strong incentives to “hallucinate” rather than say “I don’t know.”

Image from the paper https://www.arxiv.org/pdf/2509.04664 adapted by the author.

The authors’ approach is theoretical:

• They formalize the space of plausible outputs as partitioned into valid vs error responses, and relate the generative error rate to a classifier’s misclassification error (the “Is‐It‐Valid” problem).

• They derive lower bounds on error rates arising in pretraining, including factors like “singleton rate” (how many facts occur only once in training) which make some errors unavoidable.

• They then show how post‐training (e.g. fine‐tuning, reinforcement learning from human feedback) does not eliminate hallucinations if the evaluation metrics continue to reward confident but incorrect answers and punish uncertainty.

Findings & conclusions:

• Even with perfect training data, a model minimizing standard loss (cross‐entropy) will have nonzero error (hallucinations) under realistic conditions.

• Benchmarks mostly use binary grading (correct/incorrect), rarely giving any credit for abstentions or uncertainty. This creates a mismatch: models are incentivized to guess rather than admit uncertainty.

• To reduce hallucinations, the authors recommend changing evaluation practices: explicitly incorporate uncertainty or abstention options, perhaps penalize wrong guesses, make confidence thresholds explicit.

To me, this paper is interesting because it reframes hallucinations not only as a technical shortcoming, but as partly a consequence of how we reward model behaviour. It suggests that unless benchmarks and evaluations change, models will keep learning to hallucinate, because guessing wins over “I don’t know” in current setups. This has implications for any real‐world deployment where incorrect confident answers can cause harm – health, legal, scientific contexts, etc.

How should we redesign benchmarks or production settings to better value uncertainty without sacrificing usefulness? Do you think users would accept more “I don’t know” responses in exchange for fewer wrong confident ones?

Paper: https://www.arxiv.org/pdf/2509.04664

about ai

Understanding Hallucinations in Language Models: Statistical Roots and Evaluation Incentives

Published in about ai

Written by Edgar Bermudez

Responses (1)