Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Hallucinations in LLMs: Can You Even Measure the Problem?

18 min readJan 13, 2025

--

A scientist carrying a meter that can measure hallucination
Can You Even Measure Hallucination?

1. Introduction: The Curious Case of the Runaway Chatbot

Imagine this: You’re sipping your cardamom tea, about to dive into a peaceful evening of kickboxing training videos, when your friendly neighborhood chatbot suddenly claims to have invented a perpetual motion machine. No, not a new genre of music — a literal perpetual motion machine. “It’s powered by quantum banana peels,” it insists confidently.

You blink. “Wait, what?!”

Welcome to the delightful yet exasperating world of hallucinations in Large Language Models (LLMs). These machines, trained on vast amounts of data, are brilliant at spinning coherent sentences. But every now and then, they lose the plot entirely — like that one friend who’s convinced they can run a marathon after two glasses of wine.

What Are Hallucinations in LLMs?

For the uninitiated, hallucinations in LLMs occur when these models generate content that is not grounded in reality or the input provided. Picture a chef trying to make a soufflé out of thin air — it’s bound to collapse. Similarly, LLMs sometimes fabricate facts, invent relationships, or concoct information that simply doesn’t exist.

From incorrect translations in machine learning to absurd responses in chatbots, hallucinations have a way of sneaking into the output and making everyone question reality. The challenge? Identifying when this happens and finding ways to manage it effectively.

Why Does Detection Matter?

Let’s be real: A chatbot talking about banana-powered perpetual motion machines might be entertaining, but in critical applications — like healthcare, law, or science — it’s dangerous. Imagine relying on an AI for legal advice, only to discover it confidently quoted a law from the “Kingdom of Westeros.”

Detecting hallucinations isn’t just a technical exercise; it’s a necessity. And yet, measuring hallucinations is like herding cats with Wi-Fi — difficult, unpredictable, and occasionally hilarious.

The RoI Dilemma

Now comes the kicker: How do we quantify the effectiveness of hallucination management? If you’re pouring resources into detection methods, what’s the return on investment (RoI)? Are we even tracking the right metrics?

Welcome to the Hallucination RAG Race — where “RAG” stands for Retrieval-Augmented Generation, a fancy term for making LLMs less prone to daydreaming.

In this blog, we’ll dive deep into the art and science of hallucination detection, sprinkled with a healthy dose of humor, analogies, and real-world examples. Brace yourself — this is going to be an exhilarating ride.

AI robot surrounded by nonsensical inventions and confused humans.
When your chatbot trades logic for daydreams.

2. The Hallucination Spectrum: From Oops to Outrageous

Hallucinations in LLMs aren’t a one-size-fits-all problem. They come in a spectrum, ranging from mildly amusing to downright catastrophic. Think of it as the difference between your friend misremembering a movie quote and confidently telling everyone a velociraptor lives in their backyard.

The Many Flavors of Hallucination

Here’s a rundown of the hallucination types you’re likely to encounter, served with a generous dollop of humor:

  1. The Nitpicky Nerd (Word-Level Hallucination)
    Example:
    The model writes, “Albert Einstein was born in 1920,” when he was actually born in 1879.
    Severity: Low. It’s like that one friend who insists it’s “Star Wars: Episode IV,” not “the first Star Wars movie.”
  2. The Overconfident Liar (Sentence-Level Hallucination)
    Example:
    “Penguins are the only birds capable of flight at sub-zero temperatures.”
    Severity: Medium. This is where your chatbot starts sounding like the lovechild of Wikipedia and conspiracy forums.
  3. The Full-On Fantasist (Document-Level Hallucination)
    Example:
    A generated academic paper on “Quantum Unicorn Energy” complete with fake citations.
    Severity: High. Now we’re in “what alternate dimension did this model train on?” territory.
  4. The Mischievous Mixer (Cross-Domain Hallucination)
    Example:
    “The Battle of Hastings was won using Python scripts in 1066.”
    Severity: Depends. Funny in trivia night, but disastrous if you’re building a historical timeline.
  5. The Chameleon (Contextual Hallucination)
    Example:
    A chatbot discussing “masala tea” suddenly claims, “It’s a popular delicacy among Martians.”
    Severity: Variable. It depends on whether you find Martians believable.

Why Are Hallucinations So Diverse?

The diversity of hallucinations stems from the very nature of how LLMs work. These models stitch together responses based on probabilities derived from their training data. If the data is noisy, sparse, or contradictory, hallucinations emerge.

To make matters worse, LLMs often fall into the trap of “overconfidence bias” — a technical term for confidently delivering nonsense, much like that one classmate who never did the reading but always raised their hand in class.

A robot juggling books representing accurate and inaccurate information.
When an AI tries to decide what’s real and what’s ridiculous.

The Challenge of Detection: Who’s Catching the Hallucination?

Identifying these diverse hallucinations isn’t easy. For instance:

  • Word-Level Errors can slip past because they look plausible.
  • Sentence-Level Fabrications might fool you if you’re not an expert in the topic.
  • Document-Level Inventions can overwhelm you with sheer detail, like a sci-fi novel on steroids.

To make it worse, LLMs don’t come with built-in mechanisms to shout, “Wait, I made that up!” They generate responses based on patterns, and they’re not particularly good at second-guessing themselves.

Pro Tip: Humor in Detection: Imagine using a consistent nonsense test to catch hallucinations. Feed your model a random, absurd prompt like, “Explain how unicorns contribute to blockchain security.” If it responds with a straight face, you’ve got yourself a hallucinator.

This spectrum of hallucinations helps us understand why detection methods must be versatile, robust, and as vigilant as a cat watching a laser pointer. Next up, we’ll dive into how the pros (and the hilariously overambitious amateurs) tackle detection.

3. Detection Methods: From Clueless to Clever

Detecting hallucinations in LLMs is a bit like catching your dog mid-act when they’ve raided the snack cabinet — challenging but necessary. The good news? The AI community has developed a toolkit of methods to help spot these slip-ups. The bad news? Some methods are like trying to find a needle in a haystack using chopsticks.

Let’s dive into these techniques, ranked from “Hey, that’s clever!” to “Why didn’t I think of this?”

1. Predictive Probability (PP): The Basic Sniff Test

When LLMs generate text, they assign probabilities to every token (word, punctuation, etc.) based on their training. Predictive Probability says, “Hey, if the model thinks a token is unlikely, it’s probably hallucinating.”

  • How It Works: PP looks at token probabilities and flags anything with unusually low confidence.
  • Pros: Simple and quick — like checking if the milk smells off before drinking.
  • Cons: Sometimes the model is overconfident about nonsense, like saying, “Unicorns are certified blockchain auditors!” with a straight face.

Example: Imagine a chatbot saying, “Napoleon discovered America in 1492.” A low PP score for “Napoleon” in that sentence would raise a red flag.

2. SelfCheckGPT: The AI Plays Judge, Jury, and Executioner

This method makes the LLM grade its own homework. It generates multiple outputs for the same input and checks for consistency.

  • How It Works: The model compares its answers to see if they align. If there’s a wild variation, one (or all) of the outputs might be hallucinated.
  • Pros: No external data needed — just the model having a stern chat with itself.
  • Cons: It’s like asking a compulsive liar to fact-check their own autobiography.

Example: Input: “What’s the capital of Mars?”

  • Output 1: “Olympus Mons.”
  • Output 2: “Mars doesn’t have a capital, silly!”
  • Output 3: “Redville.”
    Clearly, this model needs an intervention.

3. Internal State Analysis: Peeking Under the Hood

This method involves analyzing the LLM’s inner workings — its attention patterns, embeddings, and other mysterious matrix gymnastics.

  • How It Works: By monitoring internal processes, researchers can identify when the model goes off-script.
  • Pros: It’s like checking the engine light when your car makes a weird noise.
  • Cons: Requires a Ph.D. in “Matrix Debugging Sorcery” to implement.

4. Fact Verification: The Gold Standard

If the model claims something, why not double-check it against external sources?

  • How It Works: Cross-reference the model’s output with reliable databases or knowledge graphs. If the claim doesn’t hold up, it’s flagged as a hallucination.
  • Pros: Reliable, like having a librarian fact-check your essay.
  • Cons: Slow and resource-heavy, like waiting for that librarian to alphabetize the entire reference section first.

Example:
Claim: “The Eiffel Tower is made of marshmallows.”
Fact Check: [Consults database] Nope, it’s definitely iron.

5. Sampling-Based Methods: Playing the Lottery

Why generate one output when you can generate five and pick the best one?

  • How It Works: The model creates multiple responses, and the system selects the one that seems the most factual or consistent.
  • Pros: Encourages the model to “think twice” before speaking.
  • Cons: Computationally expensive — like making five cakes just to eat one.

6. Attention Matrix Analysis: Follow the Laser Pointer

LLMs use “attention” to decide which parts of the input matter most for generating a response. This method checks if the attention patterns make sense.

  • How It Works: If the model’s attention wanders to irrelevant tokens, it’s probably hallucinating.
  • Pros: Great for catching “off-topic” errors.
  • Cons: Imagine reading your dog’s mind while they’re chasing a squirrel — yeah, it’s that complex.

7. Human Evaluation: The Gold-But-Costly Standard

When all else fails, call in the humans.

  • How It Works: Experts manually check for hallucinations in the model’s output.
  • Pros: Humans can catch nuance and context better than machines.
  • Cons: Time-consuming, expensive, and prone to human bias.

A Thoughtful Detour: Why Metrics Matter

Many of these methods rely on metrics like Recall, Precision, and K-Precision to decide if the hallucination detector is actually working. Imagine these as the report cards for detection methods, but with less grade inflation and more existential crises.

The Power of Layered Detection

Think of hallucination detection like airport security:

  1. Token-level screening (PP and entropy).
  2. Contextual checks (attention and consistency).
  3. Cross-referencing with a no-fly list (external verification).

Each layer catches what the others might miss.

Detective-themed AI analyzing text for errors.
Detecting hallucinations: part science, part sleuthing.

Detection is (just) half the battle.

Up next, we’ll explore the murky waters of Metrics and Missteps — how we evaluate detection methods and why getting it right is so tricky.

4. Metrics and Missteps: The Hallucination Scoreboard

So, you’ve built a hallucination detector, and it’s confidently flagging nonsense like “Napoleon invented Wi-Fi.” But how do you know it’s working? That’s where metrics come in — those little report cards that tell us if a detection method is actually doing its job.

But here’s the catch: Metrics are tricky. They’re like fitness trackers. They might show you ran 10,000 steps today, but they won’t tell you those steps were mostly trips to the fridge. Let’s explore the metrics that power hallucination detection and the pitfalls that make them a double-edged sword.

The Metric Hall of Fame

  1. Recall: The Overachiever
    What It Does:
    Measures how many actual hallucinations the system caught.
    Why It’s Useful: High recall means your detector has a nose for nonsense.
    Pitfall: It might flag every slightly unusual phrase as a hallucination. Cue false alarms like “The Eiffel Tower is tall.”
    Analogy: Recall is like your paranoid friend who assumes every text from an unknown number is a phishing attempt.
  2. Precision: The Perfectionist
    What It Does:
    Measures how many flagged outputs were actually hallucinations.
    Why It’s Useful: High precision means your detector isn’t crying wolf.
    Pitfall: Too much focus on precision might mean missing subtle hallucinations.
    Analogy: Precision is the friend who never answers their phone because they only want “important calls.”
  3. K-Precision: The Faithful Sidekick
    What It Does:
    Tracks how well the output aligns with a knowledge source.
    Why It’s Useful: It’s great for evaluating faithfulness in tasks like summarization.
    Pitfall: Relies on the quality of the external knowledge source, which could be outdated or incomplete.
    Analogy: K-Precision is like fact-checking with your grandma’s 1970s encyclopedias.
  4. F1 Score: The Peacekeeper
    What It Does:
    Balances Recall and Precision into one number.
    Why It’s Useful: Provides a single score to evaluate overall performance.
    Pitfall: If your balance is off, you might still be missing critical issues.
    Analogy: F1 Score is like trying to diet by eating salads but sneaking in donuts — looks balanced, but not always effective.
  5. Human Judgments: The Gold Standard
    What It Does:
    Relies on humans to evaluate whether a response makes sense.
    Why It’s Useful: Humans catch subtle errors machines miss.
    Pitfall: It’s time-consuming, subjective, and expensive.
    Analogy: This is the Michelin Guide of metrics — prestigious but not scalable.

Metric Missteps: When the Numbers Lie

Metrics sound great in theory, but they come with a heap of issues:

  1. Overfitting to Metrics:
    Models optimized for metrics like ROUGE or BLEU often generate outputs designed to “game the system.” They might repeat phrases or overuse safe answers.
    Example: A summarization model that keeps saying, “In conclusion, the author concludes.”
  2. One-Size-Fits-All Fallacy:
    Different tasks need different metrics. What works for summarization might flop in dialogue systems.
    Example: Precision might be less critical in creative writing, but it’s life-or-death in medical diagnostics.
  3. Ignoring Nuance:
    Metrics like Recall and Precision are binary — they care only about whether something is a hallucination, not how severe or subtle it is.
    Example: A model saying “Mars has two moons” (true) versus “Mars has ten moons” (wildly false) could get the same score.
  4. Human Bias:
    Human evaluations, while gold-standard, can vary based on expertise, mood, or whether the evaluator had their coffee that day.

How to Balance the Scoreboard

To truly measure success, you need a layered approach:

  1. Quantitative Metrics: Use Recall, Precision, and F1 for quick assessments.
  2. Qualitative Checks: Bring in human evaluators for nuanced, real-world insights.
  3. Task-Specific Metrics: Tailor metrics to the application. For instance:
    — K-Precision for summarization.
    — Entailment scores for factual tasks.
    — Dialogue consistency for chatbots.
  4. Feedback Loops: Regularly evaluate and update your metrics to match evolving model behavior.

Pro Tip: Keep Metrics Fun: Inject humor into your testing. Ask your model a few absurd questions like, “Who won the Nobel Prize for quantum gardening?” If your detection method misses the hallucination, it’s time for a tune-up.

AI referee catching errors on a scoreboard.
Metrics: The whistleblowers of hallucination detection.

Detection methods and metrics are powerful, but they’re only part of the story.

Up next, let’s tackle the elephant in the room: Why Detection Is Not Enough and how mitigation strategies close the loop.

5. Why Detection Is Not Enough: The Halfway House of Hallucination Management

Detection is like spotting a stain on your favorite shirt — it’s important, but identifying it doesn’t solve the problem. You still need to figure out how to clean it without making it worse. In the world of LLMs, that’s where mitigation comes in.

Hallucination detection alone leaves us in a limbo of awareness: “Yes, this is a hallucination. No, I don’t know how to fix it.” To truly manage hallucinations, we need strategies that tackle the root causes and prevent them from derailing the output altogether.

When Detection Falls Short

Let’s take a look at why detection alone isn’t enough to win the hallucination battle:

  1. Models Don’t Self-Correct
    — Detection can highlight the hallucination, but LLMs don’t automatically fix their outputs. It’s like saying, “This ship has a hole,” and then doing nothing while it sinks.
    — Without active mitigation, you might end up with a model that simply produces flagged nonsense.
  2. What About Edge Cases?
    Detection methods often struggle with subtle hallucinations. For example, when a model provides an answer that’s almost correct but subtly off (e.g., “The Moon’s diameter is 4,000 km” instead of the actual 3,474 km).
    — These “near misses” require proactive intervention to refine the response.
  3. Context Matters
    A hallucination in one context could be acceptable in another. For example, creative storytelling allows for imaginative outputs, but scientific research doesn’t. Mitigation strategies need to adapt to the use case.
  4. Models Need Guardrails, Not Just Mirrors
    Detection is reactive, but mitigation is proactive. If we only rely on detection, we’re forever stuck in a cycle of reacting to errors instead of preventing them.

The Case for Mitigation

Mitigation strategies aim to minimize hallucinations during generation. It’s like installing guardrails on a winding mountain road — not perfect, but far better than letting cars plunge into the abyss.

Here are some key approaches to mitigation:

1. Retrieval-Augmented Generation (RAG): The Reliable Wingman

RAG integrates external knowledge sources, like databases or knowledge graphs, to ground the model’s responses.

  • How It Works: Before generating a response, the model retrieves relevant information from trusted sources to enhance its output.
  • Why It Helps: It ensures the model isn’t relying solely on its training data, which can be incomplete or outdated.
  • Downside: If the external source is flawed, the model inherits those flaws. Garbage in, garbage out.

Example:
Without RAG: “The Eiffel Tower is 5,000 feet tall.”
With RAG: “The Eiffel Tower is 1,083 feet tall, according to the official Paris tourism website.”

2. Prompt Engineering: Training the Model to Stay in Its Lane

Sometimes, hallucinations happen because the model misunderstands the task or overextends its creativity. A well-crafted prompt can prevent this.

  • How It Works: You guide the model with clear, specific instructions.
  • Why It Helps: Reduces ambiguity and keeps the model focused on relevant information.
  • Downside: Crafting the perfect prompt can feel like solving a Rubik’s Cube blindfolded.

Example:
Vague Prompt: “Tell me about quantum physics.”
Better Prompt: “Explain the basics of quantum mechanics in simple terms, using only verified scientific concepts.”

3. Self-Refinement: The AI Version of Second-Guessing

In this approach, the model reviews its output and makes adjustments before presenting it. Think of it as the AI equivalent of proofreading.

  • How It Works: After generating a response, the model analyzes its own output for inconsistencies or inaccuracies and refines it.
  • Why It Helps: Encourages the model to think twice before committing to an answer.
  • Downside: Can be computationally expensive.

Example:
Initial Output: “Napoleon Bonaparte won the Battle of Waterloo.”
Refined Output: “Actually, Napoleon Bonaparte lost the Battle of Waterloo in 1815.”

4. Knowledge Graphs: The Fact-Checkers of AI

Knowledge grounding ties the model’s responses to structured data, ensuring consistency with established facts.

  • How It Works: The model incorporates knowledge graphs into its reasoning process, using them to validate claims.
  • Why It Helps: Provides multi-hop reasoning and cross-checking capabilities.
  • Downside: Requires high-quality, domain-specific knowledge graphs.

5. Confident Decoding: Keeping It Real

This method adjusts the decoding process to avoid low-probability outputs, which are more likely to be hallucinated.

  • How It Works: During response generation, the model prioritizes high-confidence outputs and avoids risky guesses.
  • Why It Helps: Reduces the likelihood of “making stuff up.”
  • Downside: May limit creativity or novelty in certain applications.

A Balancing Act: The RoI of Mitigation

Investing in mitigation strategies can feel like a game of tug-of-war. You want to minimize hallucinations without stifling the model’s ability to generate insightful or creative responses.

Here’s how to keep your investments balanced:

  1. Track Impact: Measure how much mitigation reduces errors in critical tasks.
  2. Adjust for Context: Tailor mitigation strategies to your application — whether it’s legal advice or writing poetry about quantum unicorns.
  3. Combine Forces: Use multiple approaches (e.g., RAG + prompt engineering) to maximize effectiveness.
A robot balancing between detection and mitigation strategies.
Detection tells you what’s wrong. Mitigation stops it from happening again.

Detection may tell you there’s a problem, but mitigation ensures you have a solution.

In the next section, we’ll tie it all together by exploring how to measure success and track the RoI in this hallucination management RAG race.

6. Measuring Madness: Tracking the RoI in the Hallucination (Management) RAG Race

Let’s face it: investing in hallucination management is a lot like buying a fancy fitness tracker. You need to know two things:

  1. Is it working? (Are hallucinations being caught and mitigated?)
  2. Is it worth it? (Are you getting measurable results for your efforts?)

The Return on Investment (RoI) for hallucination management is more than a financial metric — it’s a way to assess the tangible and intangible value of improving LLM reliability. In this final stretch, let’s explore how to measure success while keeping the process efficient, scalable, and, yes, a little fun.

Step 1: Define What Success Looks Like

Before you measure anything, you need to decide what “winning” the hallucination race means for you. Success metrics will depend on your use case:

  • For Chatbots: Fewer ridiculous responses like, “Your pizza will arrive via submarine.”
  • For Summarization Models: Outputs that faithfully stick to the source material.
  • For High-Stakes Domains (e.g., healthcare): Zero tolerance for hallucinated information — accuracy is non-negotiable.

Think of it like training for a kickboxing tournament. Are you aiming to win on points, or go for the knockout?

Step 2: Combine Metrics for a Full Picture

No single metric can capture the complexity of hallucination detection and mitigation. Here’s how you can mix and match:

  1. Core Metrics:
    Precision:
    High precision ensures flagged hallucinations are real issues.
    Recall: High recall ensures you’re catching as many hallucinations as possible.
    F1 Score: The happy medium between precision and recall.
  2. Task-Specific Metrics:
    Faithfulness:
    Use metrics like K-Precision for tasks where grounding matters.
    Consistency: Evaluate how consistent outputs are across multiple runs.
  3. Outcome Metrics:
    Error Reduction:
    Track the percentage drop in hallucinated outputs over time.
    User Satisfaction: For chatbots, measure how often users find responses useful or accurate.
  4. Efficiency Metrics:
    Latency:
    Ensure detection and mitigation methods don’t slow down response times too much.
    Cost: Measure compute and resource expenses — because even the fanciest solution isn’t worth it if it breaks the bank.

Step 3: Calculate RoI with a Clear Framework

To calculate RoI, you need to weigh the benefits of hallucination management against the costs of implementing it. Here’s a simplified framework:

  1. Quantify Benefits:
    Reduced Errors:
    Measure how much hallucination detection and mitigation reduce the rate of errors in outputs.
    Improved Trust: In customer-facing applications, track increased user trust or engagement.
    Business Impact: For enterprise tools, measure how improved accuracy drives revenue or reduces operational risks.
  2. Track Costs:
    Development Time:
    Time spent implementing and tuning detection/mitigation strategies.
    Compute Resources: GPU usage, API costs, or infrastructure scaling.
    Opportunity Costs: Time and effort diverted from other priorities.
  3. RoI Formula:
RoI = (Tangible + Intangible Benefits - Total Costs) / Total Costs

If your hallucination management efforts save your team time, improve customer retention, or prevent costly errors, you’re in the green.

Step 4: Iterate Based on Feedback

RoI isn’t a static number — it evolves as your model improves. Build a feedback loop to regularly evaluate:

  1. How many hallucinations are still slipping through?
  2. Which detection or mitigation methods are underperforming?
  3. Is your investment scaling with results?

Gamify Your RoI Goals

Set fun, measurable milestones for your team. For example:

  • “Reduce hallucinations by 20% this quarter — reward: Masala tea for everyone!”
  • “Achieve 95% Precision — reward: A team movie night featuring The Matrix (because, AI).”

Final Thought: The Hallucination Balancing Act

Managing hallucinations isn’t about achieving perfection — it’s about progress. The RoI of detection and mitigation lies not only in making LLMs smarter but also in ensuring they’re reliable tools for real-world applications.

Like a kickboxing champion perfecting their footwork, it takes patience, persistence, and the occasional sip of cardamom tea to win this race. But when you see your model confidently sticking to the facts — or gracefully admitting when it doesn’t know — you’ll know it was all worth it.

An AI crossing the finish line with a trophy, symbolizing success in hallucination management.
The RoI isn’t just in detection — it’s in earning trust, one accurate answer at a time.

Epilogue: Your Hallucination Survival Kit

Congratulations! You’ve now got a toolbox full of detection methods, mitigation strategies, and metrics to measure success. Whether you’re battling banana-powered perpetual motion claims or just trying to make your chatbot less absurd, remember: every small step in managing hallucinations adds up to a giant leap for reliable AI.

References

1. Hallucination in LLMs: Detection and Analysis

  • Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020). On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661.
  • Zhou, C., Neubig, G., Gu, J., Diab, M., Guzman, P., Zettlemoyer, L., & Ghazvininejad, M. (2020). Detecting hallucinated content in conditional neural sequence generation. arXiv preprint arXiv:2011.02593.

2. Detection Methods

  • Manakul, P., Liusie, A., & Gales, M. J. (2023). Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
  • Ainsworth, E., Wycliffe, J., & Winslow, F. (2024). Reducing contextual hallucinations in large language models through attention map optimization. Authorea Preprints.
  • Xu, W., Agrawal, S., Briakou, E., Martindale, M. J., & Carpuat, M. (2023). Understanding and detecting hallucinations in neural machine translation via model introspection. Transactions of the Association for Computational Linguistics, 11, 546–564.
  • Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020). On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661.
  • Cao, M., Dong, Y., & Cheung, J. C. K. (2021). Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. arXiv preprint arXiv:2109.09784.
  • Scialom, T., Dray, P.-A., Lamprier, S., Piwowarski, B., & Staiano, J. (2021). *SelfCheckGPT: Robust evaluation of factual consistency in text generation
  • Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., & Choi, Y. (2019). Defending against neural fake news. Advances in neural information processing systems, 32.
  • Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2019). The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
  • Zhang, T., Qiu, L., Guo, Q., Deng, C., Zhang, Y., Zhang, Z., … & Fu, L. (2023). Enhancing uncertainty-based hallucination detection with stronger focus. arXiv preprint arXiv:2311.13230.

3. Evaluation Metrics

  • Lin, C. Y. (2004, July). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74–81).
  • Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  • Luo, W., Shen, T., Li, W., Peng, G., Xuan, R., Wang, H., & Yang, X. (2024). HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation. arXiv preprint arXiv:2406.07070.
  • Wei, J., Yao, Y., Ton, J. F., Guo, H., Estornell, A., & Liu, Y. (2024). Measuring and reducing llm hallucination without gold-standard answers via expertise-weighting. arXiv preprint arXiv:2402.10412.Chicago

Disclaimers and Disclosures

This article combines the theoretical insights of leading researchers with practical examples, and offers my opinionated exploration of AI’s ethical dilemmas, and may not represent the views or claims of my present or past organizations and their products or my other associations.

Use of AI Assistance: In preparation for this article, AI assistance has been used for generating/ refining the images, and for styling/ linguistic enhancements of parts of content.

Follow me on: | Medium | LinkedIn | SubStack | X | YouTube |

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Mohit Sewak, Ph.D.
Mohit Sewak, Ph.D.

Written by Mohit Sewak, Ph.D.

Mohit Sewak, a PhD in AI and Security, is a leading AI voice with 24+ patents, 2 Books, and key roles at Google, NVIDIA and Microsoft. LinkedIn: dub.sh/dr-ms

Responses (4)