Intuit Presents Innovative Approach to Quantifying LLM Uncertainty at EACL 2024

Published in

Intuit Engineering

5 min readApr 29, 2024

This blog is co-authored by Xiang Gao, staff research scientist, Jiaxin Zhang, staff research scientist, Lalla Mouatadid, staff research scientist, Kamalika Das, manager, AI Research Program and Kumar Sricharan, VP and chief architect for AI at Intuit

Large language models (LLMs) have become increasingly popular in a wide range of applications. They are also notoriously prone to data hallucinations, in which they produce information that the LLM is confident is correct, but which is in reality false, logically incoherent or irrelevant. Reducing hallucinations is a high priority for LLM developers, for obvious reasons.

The variety of approaches aimed at addressing this problem have tended to fall short in one way or another. In response, Intuit’s AI Research Program team has developed a novel method to produce a more accurate measure of model uncertainty and help reduce the potential for hallucinations by sampling with perturbation for uncertaintty quantification (SPUQ) in LLMs.

Following is a summary of the SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models academic research paper presented last month at the European Chapter of the Association for Computational Linguistics 2024 (EACL) conference.

For others tackling the challenge of reducing hallucinations in today’s AI/generative AI era, we hope our team’s findings will be a thought provoking and practical contribution to the body of research in this space. Stay tuned to learn more here about our plan to open source SPUQ in the coming weeks to benefit the broader community of researchers and LLM developers.

Drawbacks of previous approaches

Because generative AI produces new content, there often is no single “correct” response to a given inquiry. The possibility of many — or even infinite — valid outputs based on a given input increases the odds of a wrong answer. Data scientists call this aleatoric uncertainty, which ultimately derives from the size of the range of valid answers from which the model has to choose: the more valid possibilities, the higher the aleatoric uncertainty attached to any single one of them.

By contrast, epistemic uncertainty stems from limitations related to the model itself. If it hasn’t been trained on the appropriate information, it won’t be able to deliver a particularly accurate answer.

Both types of uncertainty are important for LLMs. Previous approaches have mostly focused on quantifying aleatoric uncertainty. Unfortunately, many popular LLMs don’t provide access to data necessary to determine the number of potential answers they could generate, making it impossible to measure aleatoric uncertainty. Methods that circumvent this shortcoming by sampling multiple outputs and seeing how much they differ from one another. When an LLM makes a confidently incorrect prediction, however, resampling tends to yield similar results. This tendency skews confidence scores derived in this way.

The nature of epistemic uncertainty makes it harder to quantify. You simply don’t know what you don’t know. Because you can always learn more, however, you can reduce epistemic uncertainty.

A dual approach: sampling with perturbation for uncertainty quantification (SPUQ)

The AI Research team saw an opportunity to further the existing work on uncertainty quantification (UQ) in LLMs by addressing both categories of uncertainty. This novel method augments and combines existing UQ approaches and adapts them specifically for LLMs.

To address aleatoric uncertainty, SPUQ enhances its sampling methodology with an aggregation module. Typical sampling methods look for exact matches among outputs, which isn’t generally suitable for tasks like text generation with a range of equally accurate answers that aren’t necessarily identical. To address this shortcoming, we looked at the similarity between outputs and uncertainty within each output where it’s possible to obtain predicted token distribution.

Uncertainty quantification techniques: one-pass Lin, et al (2022); Kadavath, et al (2022); Chen, et al (1998); sampling-based Si, et al (2022); Wang, et al (2022), and our SPUQ method. SPUQ addresses both epistemic (via perturbation) and aleatoric (via sampling) uncertainties. Aggregation yields the total uncertainty, distinguishing SPUQ from traditional methods focused mainly on aleatoric uncertainty.

To address epistemic uncertainty, SPUQ uses a perturbation module that varies input prompts to gauge the sensitivity of the LLM to these types of changes. These changes include:

Paraphrasing the prompt in different ways.
Randomly peppering the prompt with dummy tokens such as superfluous spaces or punctuation.
Replacing the system messages that govern the tone of a response with empty or semantically similar messages.

Through extensive experimentation, this method eventually was able to reduce Expected Calibration Error (ECE) by 50% on average, a promising step toward improving the usefulness of LLMs.

Next steps for SPUQ development and refinement

SPUQ’s success in reducing expected calibration error (ECE) demonstrates its potential to make LLMs more useful across a range of tasks by increasing the reliability of their outputs. The ability to improve the accuracy of LLM-generated responses could help developers fine-tune their models more effectively, improving public confidence in the results of these systems and increasing their suitability for a wider range of applications.

Before that can happen, however, this approach will need to be developed and refined further. Intuit’s AI Research Program team’s initial experiments involved datasets that allowed for a relatively easy assessment of accuracy and used relatively simple prompts. More research will be required to ensure applicability across a diverse range of tasks and prompt structures.

Stay tuned here to learn more about our plans to open source this method, so the broader community of researchers and LLM developers can benefit from our findings. For now, you can take a deeper dive into the details of this research here: SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models and Cornell University arxiv site.

_________________________________________________________________

Intuit’s AI Research Program is an intrapreneurial function within the company that pushes the boundaries of AI. We develop and incubate AI-driven technology breakthroughs to solve our customers’ most important financial problems.

We’re a diverse team of research scientists, data scientists, and engineers with extensive expertise in AI, including natural language processing, generative AI, robust and explainable AI, symbolic AI, machine learning, and optimization.

To connect with us about open roles, partnerships, or collaborations, contact ai-research@intuit.com

Intuit Presents Innovative Approach to Quantifying LLM Uncertainty at EACL 2024

Drawbacks of previous approaches

A dual approach: sampling with perturbation for uncertainty quantification (SPUQ)

Next steps for SPUQ development and refinement

Written by Xiang Gao