Unveiling the Significance of Response Relevance and Completeness in LLMs

8 min readNov 7, 2023

We’ve all marveled at how Large Language Models (LLMs) can tackle questions with uncanny human-like precision.

But what if you asked about fixing a leaky faucet and got a history lesson on plumbing instead? It’s amusing, but not exactly what you need. You don’t want LLMs to answer questions in such a diplomatic way or dodge the actual point like politicians do in press meetings right?

Unveiling the Significance of Response Relevance and Completeness in LLMs

That’s where the importance of Response Relevance comes in. In this blog, we’ll explore why it’s vital for LLMs to not just understand, but also respond in a way that truly hits the mark.

How do we define response relevancy and completeness?

Response relevance in Large Language Models (LLMs) refers to the extent to which the generated response aligns with the query or prompt provided by the user. It assesses how well the response addresses the specific question or request made by the user.

In the case of the example in the beginning, a relevant response from an LLM would provide instructions or advice on fixing a leaky faucet. An irrelevant response, on the other hand, might provide information unrelated to faucet repair, such as historical facts about plumbing as it did!

Achieving high response relevance is crucial for ensuring that LLMs provide accurate and useful information to users. It ensures that the model’s responses are directly related to the user’s query, enhancing the overall user experience and usefulness of the AI system.

Response completeness in Large Language Models (LLMs) on the other hand, refers to the fulfillment of a complete generated response to any particular query or prompt provided by the user. It just assesses how well the response adheres to total information to that specific request made by the user.

Continuing with the example, a complete response from an LLM would provide A-to-Z instructions on fixing the leaky faucet. An incomplete response might skip a few steps in between which can be important to the faucet repair!

Therefore, it is quite clear that achieving high response completeness is also crucial for ensuring that LLMs provide foolproof usable information to the users. It must ensure that no gaps are left to be addressed, thus enhancing the overall user experience ultimately.

Why is a verbose response not ideal?

Have you ever tried to write double the number of lines than was actually required in your Essay? Homework? or Question paper? or just to get past that cheeky minimum word limit? Well, this itself refers to being Verbose. As per the definition; Verbosity, or verboseness, is speech or writing that uses more words than necessary. It comes equipped with irrelevant information, repetition of words/phrases, rewriting of similar lines, and many more... The last thing you would wish your AI-generated response to have is being verbose! But, currently, all the LLM models present nonetheless generate verbose responses, especially ChatGPT. I may/may not have used ChatGPT in this blog, but this exact essence of being verbose can clearly identify whether the response is human-generated or AI-generated! A verbose response from your LLM Model could lead to a bad reputation. Look at some of these instances:

A Community forum expressing dissatisfaction over ChatGPT Verbosity
Y Combinator news panel discussing how to identify AI-generated response
Andrew Chen’s (a16z) LinkedIn Post over ChatGPT’s standard responses

Why ChatGPT? Well because it is the most famous LLM Model currently, but there are many more with the same issue, whether it be Bard, Llama, Bing AI, or others.

Although there are a lot of ways to restrict such response quality. UpTrain’s Evals focus more on it by evaluating the LLMs & their responses.

Example — Customer Support Bot

There is an important difference one must note for a Customer Support Bot — It can be a simple Chatbot or a Conversational AI.

1. Chatbot -

Imagine a chatbot like a friendly digital helper that chats with you online. It’s a computer program that talks to you in a way that feels like a conversation. These guys are good at answering questions, giving info, or helping out with stuff, all without needing a real person to step in.

2. Conversational AI -

Now, think of conversational AI as the big brother of chatbots. It’s like a bunch of smart technologies that make machines talk and understand human language. So, it’s not just chatbots but also things like virtual assistants and voice-controlled systems. The goal? Make machines sound and act more like humans when we chat with them.

Conversational AI Customer Support Bots

Let us understand response relevancy & completeness in a more real-world manner, and none other than a Conversational AI Customer Support Bot could help us evaluate.

They have a numerous set of advantages over actual human support at the backend. Well obviously because of low costs, easy interaction, and 24x7 availability. Due to the dawn of AI, such chatbots are now equipped with AI support to answer any generic customer query they encounter.

However, here are some significant shortcomings with respect to Response Relevancy & Response Completeness:

Response Relevancy — Knowledge base shortcomings: Chatbots rely on the pre-programmed information and data they’ve been trained with. If a chatbot’s knowledge base does not include the specific issue it’s confronted with, it may not be able to provide an accurate or relevant response. Complex issues require expertise and a deep understanding of the subject matter, which often goes beyond chatbot capabilities.
Response Completeness — Limited generation: Almost all the chatbots powered by any LLM, generate a response that is only confined to what they can do, rather than what the user wants. This is the exact reason why they aren’t human-like even though they are programmed to behave the same way as humans do. This lack of response completeness gives the user an unsettling experience as the query isn’t resolved to 100% extent.

Why and what are consumers looking for?

Credits — Zendesk CX Trends 2023, Trend 1

The surge in investments towards customer service bots has not escaped the notice of consumers, who are increasingly comfortable with their presence. Among those who regularly engage with these bots, 72% have observed significant improvements in service quality, leading to a decrease in dissatisfaction levels.

Delving deeper, these consumers affirm that bots excel in handling simple inquiries, provide quicker responses than human agents, and reliably furnish accurate and useful information.

As expected, as consumers enjoy more interactive experiences with bots, their expectations have escalated. 69% of individuals seeking support now turn to bots for a broader spectrum of questions. However, notably, a substantial 78% still find the need to escalate their queries to a human agent. This trend underscores the growing trust in bot-driven customer support, which holds immense potential for businesses, but also poses potential risks.

% of consumers who frequently interact with support. Credits — Zendesk CX Trends 2023, Trend 1

Balancing Relevancy vs Completeness

Relevance and Completeness can be a tradeoff for LLMs. To get a relevant response we expect the LLM not to be verbose and be to the point. But if you write a prompt to get a complete response LLMs end up being verbose generating incomplete information. A part of this information can also be Hallucinated, more about this on Dealing with Hallucinations in LLMs: A Deep Dive

To get a good balance of relevancy and completeness you need to make sure you have careful crafting of input queries or prompts to elicit specific, accurate, and contextually relevant outputs from the model. There are two different ways to identify your prompts

System & General Prompts

1. System Prompt -

In tasks involving conversational agents or chatbots, system prompts are typically used to simulate a user’s message that prompts the model to respond. For example, in a dialogue system, a system prompt could be:

User: “What’s the weather like today?”

The model then generates a response based on this prompt. System prompts play a crucial role in controlling the behavior of the model and ensuring that it provides relevant and contextually appropriate responses.

Additionally, system prompts are used to define the role and behavior of the conversational agent. For example, in a customer support chatbot, the system prompt might specify that the agent should provide helpful and accurate information to user queries.

2. General Prompt -

Refer to open-ended queries or instructions that do not specify a particular topic or context. These prompts are designed to elicit a response from the model without providing specific guidance on what the response should be about.

Unlike system prompts, which give explicit instructions or set a particular context, general prompts are more flexible and allow the model to generate a response based on its training data and learned patterns.

User: “Tell me something interesting.”

This prompt does not provide any specific topic or context for the response. Instead, it encourages the model to generate information or facts that it deems interesting.

General prompts are commonly used when exploring the creative or generative capabilities of an LLM. They allow for a wide range of responses and can showcase the model’s ability to generate text in a more open-ended manner.

However, it’s worth noting that relying on general prompts alone may result in a wide variety of responses, which may not always be relevant or coherent. It’s important to carefully consider the context and purpose of the interaction when using general prompts.

UpTrain’s Guide to Evaluation

UpTrain is an open-source tool to evaluate your LLM applications by quantifying their performance on aspects such as correctness, tonality, hallucination, fluency, etc. We will now look into the evaluation of LLMs with respect to Relevancy and completeness.

UpTrain Eval — Response relevancy and completeness

UpTrain provides a simple and easy way to perform evaluations on your data. You can pass any of these Evals to the log_and_evaluate function and it will automatically perform the evaluation and log the results to the database.

These evals require a combination of the following columns to be present in your data:

question: The question you want to ask
context: The context relevant to the question
response: The response to the question

Log and evaluate your data

Now that we have our data, we can log it and evaluate it using UpTrain. We use the evaluate method to do this. This method takes the following arguments:

from uptrain.framework import EvalLLM, Evals, CritiqueTone import json results = client.evaluate( data=data, checks=[Evals.RESPONSE_RELEVANCE, Evals.RESPONSE_COMPLETENESS, Evals.RESPONSE_COMPLETENESS_WRT_CONTEXT] ) print(json.dumps(results, indent=3))

UpTrain supports a variety of dimensions such as language quality, retrieved-context quality, hallucinations, tonality, response completeness, etc. as well as allows one to define custom evaluations via simple Python functions. Try out UpTrain for free here or book a call with us to understand which evaluations make the most sense for your application.

References

Best Monitoring Practices for LLMs — https://whylabs.ai/blog/posts/best-practices-monitoring-large-language-models-in-nlp
Why do Customer Chatbots go well with AI — https://www.zendesk.com/in/blog/5-benefits-using-ai-bots-customer-service/
Zendesk CX Trends report 2023 — https://cxtrends.zendesk.com/trends
Chatbots vs Conersational AI — https://forethought.ai/blog/chatbots-vs-conversational-ai/

Originally published at https://uptrain.ai.