Effective Methods Against LLM Hallucination

7 min readAug 20, 2023

***Hallucination*** is the term commonly used to indicate when an LLM fabricates information that have no grounding (image generated by prompting bing “An image illustrating hallucination effect in LLMs”)

Large language models (LLMs) have proven to be valuable tools, but they often lack reliability. Many instances have surfaced where LLM-generated responses included false information. Specifically, when an LLM is queried about a topic it lacks knowledge of, such as recent events occurring after its training period, it may invent facts.

Hallucination is the term commonly used to describe this phenomenon. It refers to the tendency of LLMs to generate fabricated information that can be challenging to identify, even for humans. This issue poses a significant challenge when using LLMs in real-world applications. If you are considering developing a customer-facing system based on an LLM, providing users with false information could lead to catastrophic consequences [1].

In this article, I try to summarize the best practices and steps you could take to potentially minimize the risk of hallucination in LLMs.

TLDR (Summary)

Large language models (LLMs) often generate false information, a phenomenon known as hallucination. This poses a significant challenge when using LLMs in real-world applications. To minimize the risk of hallucination, several methods can be employed:

Adjusting LLM generation parameters: By controlling the output of LLMs through techniques like temperature adjustment, precision can be achieved in applications that require accuracy.
Different decoding techniques: Libraries like LMQL offer advanced constraints on the length and content of generated output, allowing for more controlled generation.
Self-checking: Techniques like SmartLLMChain and Self-checking involve multiple passes of the LLM output, critiquing and refining the generated ideas to improve response quality.
Causal program-aided language (CPAL) chain: CPAL enhances PAL language chain by incorporating a causal graph representation, enabling the LLM to handle complex questions involving mathematical operations.
Retrieval Augmented Generation (RAG): RAG retrieves relevant documents to provide up-to-date knowledge, reducing hallucination. However, hallucination can still occur, particularly in mathematical reasoning. To minimize hallucination in RAG pipelines, prompt engineering plays a crucial role. Fine-tuning smaller LLMs for attribution evaluation can also help assess the reliability of LLM responses. Creative fine-tuning on specialized data and expert validation can further enhance attribution models.

Adjusting LLM Generation Parameters

This is probably one of the most straightforward approaches you could take to control output of the LLMs. Most of modern LLM APIs such as OpenAI support multiple decoding techniques that offer parameters such as Temperature that you could use to control the creativity of your LLM. If your application requires precision (healthcare, law and etc.), you could settle on lower temperature values.

Different Decoding Techniques in LLMs (Image Source)

There are open-source libraries such as LMQL that take the controlled/constrained generation to the next level and provide intuitive syntaxes to apply more complex constrains.

LMQL, a framework for constrained generation (image source)

As you can see in the image above, you could place constraints on the length of the generated output as well as the characters you want the model to generate. Additionally, for older models such as text-davinci-003 you would be able to place constrains on logprobs of the generated output tokens. For instance, if you want your model to pick a choice in a multiple choice question, you could limit its output options to that choices only. It should be mentioned that OpenAI api for newer models such as GPT-x yet does not offer such advanced capabilities.

Constrain the output to multiple choice questions (image source)

Below are a few other interesting frameworks that offer constrained generation:

Self-Checking (Self-critique and Self-reflection)

Generally speaking, through smart prompting and techniques such as Chain Of Thoughts (COT), LLMs have shown that they can perform better on more complex tasks. In this section, we focus on a few techniques that offer great potential in reducing hallucination and aid LLMs achieve better quality of response.

SmartGPT (SmartLLMChain):

A SmartLLMChain is a self-critique chain designed to assist with complex questions. Instead of performing a single pass with a language model (LLM), it follows a three-step process:

Ideation: The user prompt is passed through the LLM multiple times (n times) to generate n output proposals, known as “ideas.” The value of n can be adjusted as a parameter.
Critique: All the ideas generated by the LLM are evaluated to identify potential flaws. The best idea is selected based on this critique.
Resolve: The LLM attempts to enhance the selected idea from the previous step and presents it as the final output. For more details see this video from SmartGPT author. LangChain has an implementation of SmartGPT that can get you started in seconds [2].

There is a similar technique called Self-checking [3] where the llm self-checks its output through multiple reflection prompting technique. SummarizationCheckerChain [4] is another similar strategy you could leverage.

Causal program-aided language (CPAL) chain

CPAL builds on top of Program-Aided Language (PAL) language chain by representing the causal structure of the prompt as a causal graph or DAG (check it out on langchain). In this approach, similar to agents, the LLM has access to code execution to get feedback about more complex questions that require mathematical operations.

For instance, CPAL represents the following prompt:

"Tim buys the same number of pets as Cindy and Boris."
"Cindy buys the same number of pets as Bill plus Bob."
"Boris buys the same number of pets as Ben plus Beth."
"Bill buys the same number of pets as Obama."
"Bob buys the same number of pets as Obama."
"Ben buys the same number of pets as Obama."
"Beth buys the same number of pets as Obama."
"If Obama buys one pet, how many pets total does everyone buy?"

As:

DAG representation using CPAL (Image source)

Retrieval Augmented Generation (RAG)

By August 2023, if you have not lived under a rock, there are chances that you have heard of RAG. RAG tries to first retrieve potentially relevant documents to provide additional up-to-date knowledge to the LLM so that the LLM response is less prune to hallucination [5].

However, even with RAG LLMs tend to hallucinate as shown in the image below from a study recently published by researchers from university of Ohio [6]:

As you can see above, the language model could still hallucinate when specially it comes to more mathematical reasoning that deals with symbols.

You could still potentially take the following steps to minimize hallucination in your RAG pipeline, beside the techniques mentioned above.

Prompt Engineering

Solid prompt engineering in RAG still goes a long way, below is a good prompt to get started with [7]:

Answer the question truthfully based solely on the given documents. cite the documents using document[number] notation. if multiple documents contain the answer, cite those documents like “as stated in document[number], document[number], etc.” if the documents do not contain the answer to the question, say that answering is not possible given the available information.

FineTune Smaller Models to Detect Hallucination

Fine-tuning a smaller LLM that can semantically check whether the LLM response is backed by the sources is called attribution evaluation [6]. It can be more time-consuming to finetune an LLM but seems to work pretty decently (would not eliminate hallucination all together, however). For instance, you could leverage beir_fever [8] or AttrScore [9] to finetune a smaller model that could score the attribution of the LLM response. Also checkout this repository for some implementation details.

Additionally, you could get creative and finetune models that are more specialized on your own data by first:

Fine-tuning a Llama-2 like model on your private/enterprise data
Use the fine-tuned data to generate instructions similar to the study here
Ask the experts in your company/organization to validate the generated instructions
Finally, finetune a specialize attribution model

Conclusion

In conclusion, hallucination remains a significant challenge when utilizing large language models (LLMs) in real-world applications. The generation of false information by LLMs can have catastrophic consequences, especially in customer-facing systems. However, there are effective methods to minimize hallucination and enhance the reliability of LLM responses.

Adjusting LLM generation parameters, employing different decoding techniques, and implementing self-checking mechanisms can provide greater control over the output and improve response quality. Techniques like SmartLLMChain and CPAL chain offer promising approaches to reduce hallucination by refining generated ideas and incorporating causal structures.

Retrieval Augmented Generation (RAG) can enhance LLM responses by retrieving relevant documents, but caution is needed as hallucination can still occur, particularly in mathematical reasoning. Prompt engineering and fine-tuning smaller LLMs for attribution evaluation can further enhance the reliability of LLM responses.

Overall, by implementing these methods and considering the limitations of LLMs, developers can mitigate the risk of hallucination and ensure more accurate and reliable information generation. Continued research and development in this field will be crucial to improving the performance of LLMs and maximizing their potential in real-world applications.