Member-only story
RAG Isn’t Immune to LLM Hallucination
How to measure how much of your RAG’s output is correct
I recently started to favor Graph RAGs more than vector store-backed ones.
No offense to vector databases; they work fantastically in most cases. The caveat is that you need explicit mentions in the text to retrieve the correct context.
We have workarounds for that, and I’ve covered a few in my previous posts.
For instance, ColBERT and Multi-representation are helpful retrieval models we should consider when building RAG apps.
GraphRAGs suffer less from retrieval issues (I didn’t say they don’t suffer.) Whenever the retrieval requires some reasoning, GraphRAG performs extraordinarily.
Providing relevant context solves a key problem in LLM-based applications: hallucination. However, it does not eliminate hallucinations altogether.
When you can’t fix something, you measure it. And that’s the focus of this post. In other words, how do we evaluate RAG apps?
But before that, why do LLM’s lie in the first place?
Why do LLMs hallucinate (even RAGs)?
Language models sometimes lie—all right—and sometimes are inaccurate. This is primarily due to two reasons.
The first is that the LLM doesn’t have enough context to answer. This is why retrieval augmented generation (RAG) came into existence. RAGs provide context to the LLM that it hasn’t seen in its training.
Some models work well to answer within the provided context, and others don’t. For instance, LLama 3.1 8B works fine if you provide context to generate answers, while DistilBERT doesn’t.