How to prevent LLMs hallucination🧐 — 100% reliable solution 🎯

5 min readSep 24, 2023

The problem of hallucination in Language Model Models (LLMs) is well-known. Many articles attempt to explain it, but the reason is actually quite simple: LLMs are trained to provide responses. During their training, they are not rewarded for not writing an answer, nor for responding “I don’t know”.

Many articles have tried to explain why this happens, but the main reason is very simple. LLMs primary objective is to generate text, which they do.

Part of the issue comes from the way they’re trained, especially in RLHF (Reinforcement Learning from Human Feedback) fine-tuning. This training relies on the LLM’s internal knowledge and common knowledge. However, in most commercial applications, the real advantage comes from using information from external sources.

Providing a verified source of information as a part of the prompt is widely recognized as a core principle in dealing with hallucination. This technique is commonly called RAG — Retriever Argumented Generation

Unfortunately, it’s not that simple

First of all, RAG has its own challenges:

How do we decide which bits of information from external sources should we include?
How do we break down longer documents into smaller, easy-to-handle parts (“chunking”)?

Also, RAG can’t always assist when there’s no answer to be found. In fact, in such cases, it might make things worse by providing an answer that seems really believable, even though it's completely made up.

RAG — Retriever Augmented Generation

Retriever-Augmented Generation (RAG) is a complex topic that I won’t delve into in great detail here to maintain simplicity. Feel free to skip this section if you are already familiar with RAG.

Ingestion and Indexing:

Initially, the data must be converted from various formats such as PDF or HTML into clean, structured strings. Several packages are available to facilitate this data conversion process.
Next, selecting an appropriate chunking algorithm becomes crucial. The goal is to divide the entire document into smaller, manageable pieces. These chunks should strike a balance between containing enough information to be useful (to preserve context) and not being excessively long, which could overwhelm the final prompt and potentially exceed the model’s window length.
Finally, the process moves on to indexing. Utilizing a vector database for indexing is recommended. While many databases can automatically generate semantic embeddings, even if they don’t, it is relatively straightforward to generate these embeddings manually. Why opt for a database instead of just using FAISS? Just trust me on this. Unless you are just developing a small weekend project, opt for a vector database. Especially if you expect to have more than just a sample of data, or if it might change over time.

Data Retrieval

A naive approach is to take the user’s message and find the most similar documents using semantic search. And it works… to an extent. However, you can achieve significantly better results with approaches like HyDE (Hypothetical Document Embeddings). The trick here is to let the LLM generate hypothetical or fake answers. Instead of using the original user question, you use these generated fake answers to perform the semantic search. The underlying assumption is that the document or chunk you are looking for will likely contain a sentence similar to one of these fake answers. This approach consistently yields superior results, due to the nature of semantic search.
It’s important to realize that semantic search will always provide results. The only thing you can be certain of is that the top results will be more relevant to the question than the lower-ranked ones.
You could also consider using ReRanking AI models to select the best results.
Once you have obtained the results, you need to set a limit. Semantic similarity/distance value is not really coherent. Similarity 90% would be a great match for one case, and absolutely irrelevant match for another. The simplest approach is to cap the results at a certain threshold, such as the Top 10 results. Alternatively, you might consider using a cutoff approach, where you retrieve only the results before the score experiences a significant drop. For example (0.98, 0.97, 0.968, [cutoff] 0.82 …)

Answer generation

Now that we have our search results, we need to incorporate them into the final prompt. The simplest approach is to include the discovered results in a system prompt to provide a broader context. The rest of the messages would be simply the conversation history + latest message.

Common RAG prompting technique (www.promptwatch.io)

Unfortunately, in this case, the search didn’t yield any relevant data to answer the question. The necessary information simply wasn’t available. The response from the LLM is entirely hallucinated.

So how to solve this?

Divide & Conquer

This is one of the techniques that I’ve introduced in my previous blogpost and the hallucination problem is a great case to demonstrate how to use it.

We will split the task of answering the question into two consequent prompts:

Ask LLM, to find the relevant information in the text that we’ve provided, that could be used to answer the question.
Take the output from the first prompt and pass it to the next one, to generate the answer.

1. prompt — Find the relevant information related to the question (www.promptwatch.io)

As we can see, now we are getting the result that the information we are looking for is not part of the text that we have provided.

Now we can pass this information further to the next prompt:

2. prompt —compose the answer based on the findings (www.promptwatch.io)

And suddenly we are getting a truthful answer here…

So why this works?

LLMs are powerful, but they have limited attention spans. In the original example, we are essentially asking the LLM to answer a question. However, just providing the necessary information doesn’t guarantee that the information required to answer the question is present.

The trick here lies in the fact that LLMs are much less likely to hallucinate text if we instruct them to find (preferably quote) the relevant information, as opposed to asking them to provide a direct response. This strategy is effective because of the inherent nature of LLMs; they operate as next-token predictors. In their training data, there is typically an answer following a question.

By breaking the task into two separate questions:

Identify which sentences, if any, are relevant to the question.
Utilize these sentences to compose the answer.

We disrupt the LLM’s usual pattern of operation and provide it with a singular, focused task.

Conclusion

Hallucinations have been a major issue, but as you can see, they can be solved. The drawback of this approach however is, that the latency for this response suddenly approximately doubled. That is because we need to do two prompts instead of one.