Best of N: Generating High-Quality Grounded Answers with Multiple Drafts

Anush Mattapalli
Google Cloud - Community
7 min readJan 8, 2025

RAG, or retrieval augmented generation, is a technique that boosts the abilities of language models by letting them access external knowledge sources when generating text. LLMs depend on the information they were trained on, which may now be outdated. RAG empowers the model to pull in relevant information from the web or other databases to craft more accurate and informed responses. This post-hoc approach of retrieving information after an initial draft has been generated allows for more targeted and contextually relevant responses. Other approaches to improve answer quality exist, such as reinforcement learning from human feedback (RLHF) during pre-training, where the model learns to avoid generating inaccurate or irrelevant information through interaction and feedback.

While these approaches of generating a single response from the provided facts is often reasonable, it can still sometimes lead to inaccurate and unhelpful responses.

Let’s take an example. Say we have the query,

What’s the weather in Seattle today?

Often, when we have a live query (a question where the answer changes based on the time it is asked), the retrieved documents may not be 100% relevant. In the three retrieved documents for the Seattle weather query, two documents contained conflicting information about the weather. This confusion manifested itself in the generated answer. The real temperature in Seattle on that day was a high of 68°F and a low of 56°F with a 52% chance of rain showers (which wasn’t even mentioned):

Today in Seattle, the weather is sunny and a high of 71°F. There is a slight chance of rain before 1pm today, with the temperatures dipping to a low 58°F.

Instead of relying solely on a single generated response, we propose an alternative “Best of N” approach. This method involves generating multiple responses, evaluating each for helpfulness and groundedness, and then selecting the best one. This approach allows for a more diverse and higher-quality output without the need for extensive model fine-tuning.

What is Groundedness and Helpfulness

Groundedness, in the context of language models, refers to the degree to which a response is based on a corpus. A grounded response avoids making claims or assumptions that are unsupported or contradict established facts. It’s about ensuring that the AI’s output is anchored in reality and doesn’t veer into speculation or misinformation. It’s particularly crucial when dealing with topics where accuracy and reliability are paramount, such as scientific or medical information.

Helpfulness, on the other hand, focuses on whether the response effectively fulfills the user’s needs or requests. A helpful response would address the core intent of the instruction, provide accurate details all while being concise. It should directly answer the question asked or fulfill the task requested, offering factually correct and relevant information. It also needs to be clear and straightforward, avoiding unnecessary details or jargon. Essentially, a helpful response is both timely and informative, effectively addressing the user’s needs or request.

Improve Answer Quality using Best of N

As previously mentioned, Best of N generates multiple candidate responses for an answer. These responses are sampled at a high temperature (0.9 in our experiments). Sampling candidates in this way allows us to select from a wide variety of responses. This sampling approach lets us choose from a diverse range of possible responses.

Let’s apply this to the Seattle weather query. Setting the temperature to 0.9 and generating two candidates, here are the resulting generations:

In Seattle, the high will be 68°F with a 50% chance of rain showers.

In Seattle today, the high will be 68°F and the low will be 56°F with a 52% chance of rain showers in the morning, becoming mostly cloudy in the afternoon.

It’s clear that the second generated response provides a comprehensive and grounded answer that is truly helpful for the user.

While it’s easy for a human to quickly judge the quality of the responses (especially in our toy example), a systematic approach is necessary when dealing with numerous, extensive responses and complex factual information.

Gold Labels

In order to quantify a response’s groundedness and helpfulness, gold labels are required. Gold labels serve as the ground truth, providing a benchmark against which a response’s groundedness and helpfulness can be measured. In this context, we employed two oracle critic models, which are highly reliable models for evaluating groundedness and helpfulness. These models serve as a fundamental component in the process of determining the gold labels, enabling us to quantify and analyze the response’s characteristics.

Baseline Results

In our experiments, we generate drafts using two generation models: Gemini-1.0-pro-002 and OpenBookQA. We use two different models to show that this method is model agnostic.

Our test set contains 1254 queries, and for each, we generated a response at temperature 0 to simulate a baseline. On this test set, two generation models achieved average helpfulness scores of 0.92 and 0.80 and an average of 0.58 and 0.82 in groundedness respectively.

This graph displays the baseline groundedness and helpfulness scores for both Gemini 1.0 and OpenBookQA models.
This graph displays the baseline groundedness and helpfulness scores for both Gemini 1.0 and OpenBookQA models

Benchmarking

Now, we implement Best of N on our evaluation set. We generated 11 candidates for each of the 1254 queries (1 at temperature 0, the baseline response, and 10 at temperature 0.9). This corresponds to an N of 10 (more information on selecting N is given below).

Although large LLM raters such as our oracles are very good at judging helpfulness and groundedness, they’re not practical in many use cases. For us, these raters are far too slow to use every time we would like to score a response. That’s where Check Grounding comes in. Check Grounding now provides groundedness and helpfulness scores for your generations. Check Grounding employs smaller models to score groundedness and helpfulness. Using these more efficient oracle score proxies, we can quickly judge many candidates side by side and select the best one just as effectively.

From the generated candidates, we then select the best one using each candidate’s proxy groundedness and helpfulness score. Finally, we selected the best scoring candidate and compared it to the 0 temperature response. Below, you will find a chart that depicts the gains you can expect to see using this method when compared to the baseline 0 temperature response.

This graph displays the improvements of the Best of N method compared to the baseline results for both Gemini 1.0 and OpenBookQA models. We see significant gains across the board.
This graph displays the improvements of the Best of N method compared to the baseline results for both Gemini 1.0 and OpenBookQA models

Using Check Grounding to provide groundedness and helpfulness scores has shown to be an effective method in improving the overall quality of a generated output. Additionally, this method is model agnostic and is not a tradeoff, the gains seen in one category is not at the cost of the other.

Gemini 1.0 demonstrates a remarkable improvement in groundedness, increasing by 21% from a baseline of 0.58 to 0.79. Additionally, helpfulness also sees a significant increase, rising from 0.92 to 0.945. These positive trends are mirrored in the evaluation on OpenBookQA, where groundedness and helpfulness experience gains of 5.5% and 2.5%, respectively.

How to Choose N

The value of N, the number of candidate responses to generate, is a key parameter in Best of N. While increasing the value of N leads to improved outcomes, it also results in higher costs due to the increased expense associated with generating multiple responses compared to a single response.

There are a few factors to consider when choosing N:

  • The desired quality of the response: If you need a very high-quality response, you will need to generate more candidates.
  • The cost of generating responses: The cost of generating responses is proportional to the number of candidates.

In general, we recommend starting with a small value of N and then increasing it if you are not satisfied with the quality of the response. We’ve seen significant improvement in answer quality using just two candidates.

How to Weigh Helpfulness and Groundedness Scores

When using Best of N, you can choose to weigh the helpfulness and groundedness scores returned by Check Grounding. This allows you to control the trade-off between groundedness and helpfulness in your answers. To weigh the scores, simply take a weighted average of the scores returned by Check Grounding.

If you desire a response that is more grounded in factual information, you have the option to assign greater significance to the attribution score. Conversely, if you seek a response that is more practically useful, you can choose to prioritize the helpfulness score.

With equal weighting 0.5 Groundedness x 0.5 Helpfulness:

A table containing the three previously generated examples and the resulting groundedness, helpfulness and overall weighted scores.
A table containing the three previously generated examples and the resulting groundedness, helpfulness and overall weighted scores.

In this table, the first generation in red provides a detailed weather description, while the second one is concise and lacks additional information. Although the first sacrifices some groundedness, it offers a more helpful response.

Note: The lack of grounding in generation #1 was exaggerated for illustrative purposes. In practice, responses can be slightly more helpful without significantly compromising groundedness.

In general, we recommend starting with a weight of 0.25 for groundedness and a weight of 0.75 for helpfulness. Our results have shown a skew towards groundedness improvement, so leaning the weights towards helpfulness will help improve scores in both metrics equally. However, you can adjust the weights to suit your specific needs.

Conclusion

We’re excited about the potential of Best of N to improve the quality of grounded generations. If you’re interested in learning more, please visit our website to contact us for a demo.

I want to thank Jane Day, Hui Wan and Long Le for their contributions to this article.

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

No responses yet