ML Olympiads: detect hallucinations in LLMs with Google Gemini

Luca Massaron
7 min readMar 31, 2024

Detecting hallucinations is challenging, gut Google Gemini can help

credits: DALL·E 3

In a Kaggle competition — part of the #MLOlympiad — participants were challenged to develop algorithms capable of discerning hallucinatory responses generated by the Mistral 7B Instruct model.

First of all, the ML Olympiad is back for 2024 and will last a few more days! ML Olympiad, revolving around Kaggle Community Competitions, is organized by ML GDE, TFUG, and various third-party ML communities, with continued support from Google Developers. Following the successful launch in 2022, the ML Developer Programs team and these communities provided another exciting round of ML training opportunities for developers through Kaggle’s features this year.

Returning to the competition, given a set of responses generated by the Mistral 7B Instruct model, participants are tasked with designing machine learning models or algorithms that can accurately identify which responses are most likely hallucinations. The goal is to distinguish between genuine, contextually appropriate responses and nonsensical, misleading, or otherwise erroneous ones. The competition aims to advance the field of natural language processing by addressing the challenge of detecting and mitigating hallucinations in language generation models.

The dataset has been built by prompting instructions from the Open-Orca dataset (https://huggingface.co/datasets/Open-Orca/OpenOrca) using a Mistral 7B Instruct large language model. Constructed from a vast repository of enriched FLAN data, this dataset presents a goldmine for individuals involved in Natural Language Processing (NLP). It is intricately aligned with the distributions outlined in the Orca paper, showcasing its pivotal role in producing top-notch model checkpoints and serving as a prized resource for NLP aficionados. Based on the answers Mistral returned, the procedure has been to label the feedback as a proper answer or a hallucination based on internal metrics, heuristics, and direct scrutiny. Hence, the task of the competition is to develop an algorithm to score answers probabilistically, pointing out answers more likely to be hallucinations. In fact, the scoring is evaluated on the area under the ROC curve between the predicted probability and the observed target. The idea behind the competition is that by generalizing a model to detect hallucinations, it is possible to improve the labeling capabilities beyond the simple heuristics and metrics.

Up to now, there have been a few Kagglers trying to crack the problem, taking different approaches, from XGBoost to Transformers:

A sample of the Kaggle Notebooks developed for the competition

Before the competition closes, I want to offer my own perspective by providing a feature based on Google Gemini's powerful and state-of-the-art generative capabilities that can help detect a hallucination produced by another large language model better.

Large language models (LLMs) are impressive tools, but they can sometimes make mistakes, leading to what are known as hallucinations. These aren’t spooky or fantastic visions; instead, in LLMs, hallucinations refer to incorrect, nonsensical outputs, or simply made up. Despite its vast training data, a large language model might invent facts, stray from the topic, or generate text that seems convincing but isn’t grounded in reality. Hallucinations in LLMs occur due to their probabilistic nature and the limitations of their training data, where they generate text that can be factually incorrect, nonsensical, or disconnected from the input prompt.

LLMs generate text based on patterns learned from their training data, rather than properly understanding facts and concepts. This can lead them to produce outputs that seem coherent but are actually false or irrelevant. Hallucinations can take various forms, including generating false information, contradicting the prompt, producing nonsensical output, or including irrelevant details. Studies have shown that hallucination rates can be high, with ChatGPT exhibiting a hallucination rate of up to 31%.

These hallucinations are a significant concern as they can lead to the spread of misinformation and undermine the reliability of LLM-generated content, especially in critical domains like healthcare, law, and finance. Strategies such as providing more context in prompts, using model evaluation techniques, and implementing human review processes can mitigate hallucinations. However, it’s crucial to understand that hallucinations cannot be completely eliminated, so users must exercise caution when using LLMs and cross-reference their outputs against reliable sources.

Recent research has shown that LLM hallucinations can arise from various factors, including source-reference divergence in training data, the exploitation of jailbreak prompts, reliance on incomplete or contradictory datasets, overfitting, and the model’s propensity to guess based on patterns rather than factual accuracy. Methods such as prompt tuning, measuring perplexity, and evaluating anti-hallucination techniques have been proposed to address this.

In addition, recent research pointed out the usage of LLMs themselves to point out hallucinations. The use of Large Language Models (LLMs) to detect the level of hallucinations in the responses of other LLMs is an emerging area of research. Several tools and frameworks, such as G-Eval, GPTScore, SelfCheckGPT, TRUE, ChatProtect, and Chainpoll, are being developed to measure and prevent LLM hallucinations. These tools aim to reduce the likelihood of hallucinations by adopting a metrics-driven approach to LLM evaluation, defining test cases, and setting up automated checks to quantify the effectiveness of hallucination countermeasures.

Here, Google Gemini comes into the game. Google Gemini is one of the most powerful LLMs around. It will surely answer with much less hallucination degree to the same problems handled by Mistral. Let’s use it as a watchdog and check if Gemini’s answer differs much from Mistral’s. We can achieve that by using a similarity measure turned into a normalized distance (cosine distance): the more distant the embeddings of the answer from Gemini are far from that from Mistral, the more likely Mistral’s answer is a hallucination. The point here is that the best way to evaluate an LLM’s answer is to contrast it with another LLM’s answer: if they differ too much, there can be a problem! The trial is also a good tutorial on how to use Gemini; it is quite easy and straightforward!

Here, we start, and you can find all the code in this repository: https://github.com/lmassaron/gemini_hallucination_detection . Naturally, we work on Google Vertex AI, using a Kaggle Dataset turned to the Google Cloud: all you have to do is to select the “Open in Google Notebooks” (but you also need a Google Cloud account and project, by the way)

As a first step, we import the train and test data from the competition:

import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.metrics import roc_auc_score

train = pd.read_csv("./kaggle/input/ml-olympiad-detect-hallucinations-in-llms/train.csv")
test = pd.read_csv("./kaggle/input/ml-olympiad-detect-hallucinations-in-llms/test.csv")

The following step is to define the project information (especially your project ID is crucial) and initialize Vertex AI:

# Define project information
PROJECT_ID = "<your_project_id>" # @param {type:"string"}
LOCATION = "us-central1" # @param {type:"string"}

# Initialize Vertex AI
import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

We complete the initialization by starting two models: Gemini 1.0 Pro (https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/overview#gemini_models) for the generative tasks and Gecko (https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings) for the embeddings:

# Initialize Vertex AI models
from vertexai.language_models import TextEmbeddingModel
from vertexai.generative_models import GenerativeModel

model = GenerativeModel("gemini-1.0-pro")
embedding_model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

Then, we define a function, get_distance, that computes the distance between a generated answer from Gemini and the target answer provided by Mistral, by utilizing embeddings and dot product calculations.

def get_distance(prompt, target_answer, model, embedding_model):
"""Calculate the distance between a generated answer and a target answer"""

try:
# Generate an answer from the prompt using the model
response = model.generate_content(
prompt,
generation_config={"temperature": 0},
)
# Extract the generated answer from the response
answer = response.candidates[0].content.parts[0].text
except:
# If an exception occurs during answer generation, set the answer to "no answer"
answer = "no answer"

# Get embeddings for the generated answer and the target answer
embedded_answer = embedding_model.get_embeddings([str(answer)])[0].values
embedded_target = embedding_model.get_embeddings([str(target_answer)])[0].values

# Calculate the dot product between the embeddings and normalize it
dot_product = np.dot(embedded_answer, embedded_target) / 1.

# Calculate the distance between the embeddings
distance = 1 - dot_product

return distance

The next function just wraps up the distance functionality. In fact, the function, calculate_scores, computes scores for each row in the provided dataset by utilizing the get_distance function to measure the similarity between generated answers and target answers.

def calculate_scores(data, model, embedding_model, max_retries=10, print_interval=None):
"""Calculate scores for each row in the provided data"""

scores = []

# Iterate over each row in the data
for row in tqdm(range(len(data))):
score = None
retry = 0

# Attempt to calculate score with retries
while score is None and retry <= max_retries:
score = get_distance(data.iloc[row].Prompt, data.iloc[row].Answer, model, embedding_model)
retry += 1

# If score is still None after retries, set it to 0.25
if score is None:
score = 0.25

scores.append(score)

# Print progress and ROC AUC score if print_interval is set
if print_interval is not None and row > 0 and row % print_interval == 0:
roc_auc = roc_auc_score(y_true=data.Target.iloc[:len(scores)], y_score=scores)
print(f"{row}/{len(data)} ROC AUC: {roc_auc}")

return scores

At this point, we are just ready to start to compute the scores for the train and test data:

# Compute scores for the train data
train_scores = calculate_scores(train, model, embedding_model, max_retries=10, print_interval=500)

# Compute scores for the test data
test_scores = calculate_scores(test, model, embedding_model, max_retries=10, print_interval=None)

After finishing, we can save the train scores to use them as a feature for stacking with other methods (XGBoost, Transformers, or even Logistic Regression). We can also try to save the test results as a submission.

# Save scores for the train data
train["score"] = train_scores
train.to_csv("scored_train.csv", index=False)

# Save your submission
submission = pd.read_csv("./kaggle/input/ml-olympiad-detect-hallucinations-in-llms/sample_submission.csv")
submission.Target = test_scores
submission.to_csv("submission.csv", index=False)

Arriving at having a hallucination estimation has been a piece of cake!

https://unsplash.com/it/@cowink_

#MLOlympiad #GeminiSprint

Google Cloud credits are provided for this project.

--

--

Luca Massaron

Data scientist molding data into smarter artifacts. Author on AI, machine learning, and algorithms for Wiley, Packt, Manning. 3x Kaggle Grandmaster.