A confidence score for LLM answers

Max Baak

Published in

inganalytics.com/inganalytics

11 min readMay 23, 2024

We provide a simple confidence score for LLMs answering your questions!

TL;DR

Generative AI models like Large Language Models (LLMs) generate answers with varying levels of accuracy, giving factually inaccurate answers in some cases. As a result, validation and evaluation of LLM results is a field of interest to many users and developers.

ING Wholesale Banking Analytics (WBA) contributed a new functionality to the popular transformers package, allowing one to extract raw probabilities of generated text by LLMs. This makes it possible to calculate confidence score for LLM answers provided to question-prompts. This blog shows how and we invite you to try it out!

Generative AI for Question-Answering

At ING we deal with a lot of documents, and are doing multiple Generative AI projects to help process those efficiently, including several based on Retrieval Augmented Generation (RAG).

In short, a RAG pipeline couples a search engine to an LLM. This allows one to ask questions to documents and retrieve answers, known as generative question-answering (QA). Specifically: QA by the LLM based on relevant text passages retrieved from a document. Generated answers to questions are grounded by the retrieved text, thereby severely reducing the risk of hallucinations. Another advantage is that a RAG setup can be used for many types of documents; a priori no dedicated fine-tuning of the LLM is required. For more RAG details see e.g. here.

The RAG projects aim to automate the extraction of information from unstructured documents. Typically a fixed set of questions needs to be answered for a large batch of similar documents. We use the answers for automated form-filling, resulting in a structured summary dataset.

Here’s an example project, dealing with annual reports from our corporate clients. ING is obligated to report on the greenhouse gas emissions by its clients, which can be retrieved from their annual sustainability reports. Example questions are: what are the scope 1, 2 and 3 emissions over the past year? Extracting this information over thousands of documents is a labor intensive and manual task that can be automated — at least partially — using a RAG pipeline.

Reliable answers?

The problem at hand is: are the generated answers reliable? We find that the quality varies; it depends on the complexity of the question, the type of document, and the LLM used.

Right now, to guarantee correctness, in our projects all extracted answers still need to be validated by experts. It is important this process can be done smoothly and efficiently. Can we provide a confidence score to indicate if an extracted answer is correct? Having this could be very useful for the validation.

Normally, LLMs do not return confidence scores for generated answers, and these answers are not necessarily correct. LLMs are not designed to do so. Fundamentally they are trained to predict the next sub-word, the so-called token, to a text passage, and — using reinforcement learning from human feedback — to provide well-rated answers. But no confidence scores.

How then to obtain these scores for generative QA?

Yes/No confidence score for LLMs

In this post we explore one technique where the confidence score is extracted from the next-token probabilities (“unprocessed logits”) of a causal language model — these are used by an LLM to predict the next token to a prompt. The technique is inspired by Hegselmann et al. 2023, where GPT3 is used as a binary classifier.

There are three steps to our approach.

Based on a provided Context and Question the LLM provides its Response.
Based on the provided Context and Question asked, how confident is the LLM that the Response is correct? We ask the LLM a follow-up question: “Is the provided Response correct? Answer Yes or No.”
The unprocessed next-token probabilities are retrieved for the tokens Yes and No. From these we calculate the so-called Yes-score. (See the formula below.)

The confidence score is calculated as the relative ratio of Yes and No probabilities:

By construction the Yes-score is in the range [0, 1], where zero means only the No component, and one only the Yes component.

Our code contribution

At ING WBA we have contributed code to the well-known transformers package, that allows us to retrieve the unprocessed next-token probabilities from a causal language model.

outputs = model.generate(input_ids, output_logits=True, return_dict_in_generate=True)

The unprocessed probabilities are returned with the option output_logits=True.

The transformers package could already return processed next-token probabilities, with the option output_scores=True, but by default these are modified by so-called LogitsProcessor and LogitsWarper objects. These two reduce the number of available next tokens down to one and update its probability to 1.0. That does not work for the Yes-score, which needs both the Yes and No probabilities.

Python code to calculate the Yes-score for a causal language model in the transformers package is provided in the Appendix below. To use it, be sure to download a version that includes our new feature:

pip install -U transformers>=4.38.0

Let’s check how well the Yes-score works!

Tested performance

For the tests below we’ve used Meta’s Llama2–13B-Chat model, which can be downloaded for the transformers package from Hugging Face, but our approach should work for any causal language model.

The Yes-score is tested using the Stanford Question-Answering Dataset (SQuAD), version 2.0. This is a dataset of over 140.000 questions with manually curated, ground-truth answers (called “correct” answers below), based on very diverse Wikipedia articles.

A crushing example

Here is an example from the SQuAD dataset, based on the computer game The Legend of Zelda: Twilight Princess.

Context: Ganondorf then revives, and Midna teleports Link and Zelda outside the castle so she can hold him off with the Fused Shadows. However, as Hyrule Castle collapses, it is revealed that Ganondorf was victorious as he crushes Midna’s helmet. Ganondorf engages Link on horseback, and, assisted by Zelda and the Light Spirits, Link eventually knocks Ganondorf off his horse and they duel on foot before Link strikes down Ganondorf and plunges the Master Sword into his chest. With Ganondorf dead, the Light Spirits not only bring Midna back to life, but restore her to her true form. After bidding farewell to Link and Zelda, Midna returns home before destroying the Mirror of Twilight with a tear to maintain balance between Hyrule and the Twilight Realm. Near the end, as Hyrule Castle is rebuilt, Link is shown leaving Ordon Village heading to parts unknown.
Question: What does Ganondorf crush?
Correct answer: Midna’s helmet.

Asking this question and providing the context, Llama2 correctly answers “Midna’s helmet”.

We then construct the following prompt and run it by Llama2, based on which the Yes-score is calculated:

Context: {Context}

Question: {Question}

Response: {Response}

Based on the given Context and Question, answer this question:

Is the provided Response correct? Yes or No?

Answer:

In the prompt we vary the response and see how the Yes-score changes.

Response: the helmet of Midna. Yes-score: 66.9%,
Response: Midna’s helmet. Yes-score: 66.4%,
Response: Hyrule Castle. Yes-score: 45.3%,
Response: the Mirror of Twilight. Yes-score: 28.3%,
Response: Midna’s shield. Yes-score: 26.0%,
Response: Link’s helmet. Yes-score: 8.5%,
Response: Gandalf’s staff. Yes-score: 1.5%.
Response: The Tower of Orthanc. Yes-score: 0.0%

The correct answers (“Midna’s helmet” and “the helmet of Midna”) get the highest Yes-scores. For the incorrect answers the scores are lower, although sometimes still quite high (“Hyrule Castle”). The other responses become more and more incorrect, and the Yes-scores drop towards zero, as desired.

Extracted scores on a larger set

On a larger test set of several thousand SQuAD passages the scores look as follows.

When filling in incorrect responses (from entirely different passages), the Yes-score peak strongly at zero (pink distribution). Note there’s also a small peak at 0.5, where the Yes and No components receive the same (tiny) probability.
For random text responses, extracted from the passage that contains the correct answer, the Yes-score distribution is wider (brown distribution): it peaks close to zero with a bigger tail towards one.
The correct responses result in the green distribution. This peaks close to one but has a fat tail towards zero.
Filling in Llama2’s responses to the questions results in a distribution (magenta) similar to the correct responses. The peak at one is slightly wider, with the same fat tail towards zero.

The distributions show there is a clear separation in Yes-scores between correct and incorrect answers. Incorrect ones peaks close to zero and correct ones close to one. The correct answers still have a fat tail towards zero, suggesting the Llama2 has doubts about some of these. Finally, Llama2’s response distribution is very similar to the correct distribution, suggesting most if not all of its answers are correct. We shall see in the next sub-section that is not true, however.

The ROC curve of the correct versus the “random related” responses is shown here.

The area under the curve is 0.85, which means Llama2 works reasonable well as a classifier to discriminate between correct and randomly related answers.

Checking the answers manually

A manual check on some of Llama2’s responses is provided here on some of the SQuAD passages, comparing against the correct answers.

The orange boxes show answers that are incorrect; clearly Llama is not always right. On a manually labelled test set of 200 randomly selected SQuAD passages we find that Llama2 achieves a QA precision of 78.5%.

This also illustrates that the Yes-score distribution of Llama2 responses (above, in magenta) is a combination of both correct and incorrect answers.

Alternative scores

More approaches to getting confidence scores are available for QA.

Extractive QA models

BERT-based NLP models are good at QA tasks; see for example RoBERTa-Base-SQuAD2 (RoBERTa). These models can be easily trained to do extractive QA, meaning: they learn to extract the answer from the provided input text. A trained model returns the answer, a confidence score, and the start & end positions of the answer in the text. The score is the probability that both the start and end positions of the answer are correct. Exactly what we need for validation!

There are many BERT-based QA models available on Hugging Face, such as BERT, RoBERTa, DeBERTa, typically trained on the SQuAD dataset. One disadvantage: these models accept only small text passages as input, only 512 tokens (versus 4096 for Llama2).

RoBERTa performs well on the SQuAD dataset, better in fact than Llama2. However, in our RAG setup we find RoBERTa performs worse than Llama2. Presumably this happens because SQuAD is a dataset with single, short text passages only, whereas the RAG pipeline returns multiple (possibly long!) text passages, and finding the right answer from those is a tougher challenge.

Score prompting

In their paper K. Tian etal. 2023 have suggested to simply ask LLMs directly for (calibrated) confidence scores, something they tested on the models GPT-3.5, GPT-4, Claude 1 & 2, and Llama-2–70b-chat. They conclude that:

[LLMs] can often directly verbalize better-calibrated confidences (either a numerical confidence probability or an expression such as ‘highly likely’) than the models’ conditional probabilities.

We have tried their suggested prompt with Llama2–13B-Chat:

Based on the given Context and Question and Response, provide the probability that the Response is correct (0.0 to 1.0), with no other words or explanation. Answer:

However, we have the opposite finding: this prompt does not work well on the smaller model Llama2–13B-Chat. Only a handful of possible discrete values are returned, and furthermore these do not provide very good separation between correct and incorrect answers. Apparently, larger language models are required for this approach to work!

Other tokens besides Yes and No

When asking for “Yes” and “No” responses only, there is a non-negligible probability that the model replies with similar tokens such as: yes, no, Yes!, No!, NO, YES, true, false, etc. We can group all these into positive and negative token sets, sum their probabilities, and calculate an alternative Yes-score as such.

Initial tests showed little difference in Yes-score performance though, so for simplicity we have ignored these other tokens (for now).

Conclusion

We have implemented a confidence score for LLM answers provided to question-prompts, the Yes-score, and have tested this for a smaller LLM (Llama2–13B-Chat). The score can be used as a naive discriminator between correct and incorrect answers. We find it performs better than similar scores from BERT-based models and score prompting. The code to calculate it has been added to the transformers package. We invite you to try it out and are happy to hear your feedback!

Credits

Co-authors: Hazal Koptagel and Nilay Afacan. The code was authored by ING Analytics Wholesale Banking. We thank Nikoletta Bozika for reviewing this blog.

Appendix: two code examples

Here are two examples of the Yes-score. For both, be sure to download the minimum required version of the transformers package:

pip install -U transformers>=4.38.0

First, a demo notebook for the Yes-score can be found here. It runs on the SQuaD dataset and compares the performance of the Yes-score with the Roberta confidence score.

Second, we focus on the Python instructions to calculate just the Yes-score:

# important, available at this point are: tokenizer, model, device, ...
# context, question, response on that question from the model.

# 1. make the yes/no prompt
prompt = make_yes_no_prompt(context, question, response)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
input_length = input_ids.shape[1]

# 2. generate the yes/no answer
#    be sure to generate output with options: output_logits=True, 
#    and return_dict_in_generate=True
outputs = model.generate(input_ids, output_logits=True, return_dict_in_generate=True, max_new_tokens=5)

# 3. calculate the yes-score 
yes_score = yes_score_calculation(outputs, input_length, tokenizer)

This makes use of the following two functions:

from typing import Any


def make_yes_no_prompt(context: str, question: str, response: str) -> str:
    return f"""Context: {context}

Question: {question}

Response: {response}

Based on the given Context and Question, answer this question:

Is the provided Response correct? Answer only Yes or No.

Answer:
    """


def yes_score_calculation(outputs: Any, input_length: int, tokenizer: Any) -> float:
    generated_tokens = outputs.sequences[:, input_length:]

    # 1. find the index (idx) of the first character-based token.
    for idx, tok in enumerate(generated_tokens[0]):
        next_token_str = tokenizer.decode(tok, skip_special_tokens=True)
        n_letters = sum(c.isalpha() for c in next_token_str)
        if n_letters != len(next_token_str):
            continue
        break
    
    # 2a. do preselection on high probabilities (out of 32k tokens)
    probs_all = torch.nn.functional.softmax(outputs.logits[idx][0], dim=-1)
    indices = torch.argwhere(probs_all > 0.001)
    indices = indices[:, -1]
    tokens_max = tokenizer.batch_decode(indices, skip_special_tokens=True)
    probs_max = probs_all[probs_all > 0.001]
    
    # 2b. find yes/no probabilities
    next_token_dict = {str(t): p for t, p in zip(tokens_max, probs_max)}
    yes_prob = next_token_dict.get("Yes", 0.)
    no_prob = next_token_dict.get("No", 0.)
    
    # 3. calculate and return yes/no confidence score
    yes_score = yes_prob / (yes_prob + no_prob) if yes_prob != 0 or no_prob != 0 else 0.5
    return yes_score