Data Science at Microsoft

Lessons learned in the practice of data science at Microsoft.

Beyond thumbs up and thumbs down: A human-centered approach to evaluation design for LLM products

--

By Shima GhassemPour, Anna Pershukova, Esra Gokgoz, and Irina Nikulina

This image was created by article co-author Irina Nikulina — crafted with creativity, no AI.

Despite the hype around Generative AI models, businesses often overlook the importance of a carefully designed evaluation process. Generative AI operates in a fundamentally non-deterministic manner, meaning its outputs can vary even when given the same input. Whether it’s chatbots, virtual assistants, or content generation tools, the need for evaluation and user feedback is critical to assess, maintain, and improve system performance and ensure long-term user adoption. A carefully designed evaluation strategy allows one to de-risk Generative AI systems operating in real-life environments.

Collection of ground truth data — feedback from experts on what ideal or expected results would look like in a diverse set of cases — plays a key role in validating outputs, ensuring accuracy, and addressing biases. Recent studies emphasize that LLM-based evaluations, while scalable, cannot fully replace ground truth data. Over-reliance on synthetic or model-generated data, without grounding in real data, risks “model collapse,” where performance deteriorates over time. Collecting feedback from users is essential, but many user feedback systems rely on simplistic mechanisms like thumbs up and thumbs down, which often fail to capture the nuances required for meaningful improvements.

To enhance model performance, we need a thoughtful, user-driven approach to evaluation that identifies specific issues and informs improvements through a robust feedback loop. A holistic approach considers not just model performance but also whether the product or system achieves its intended outcomes in its context of use. For example, a model may generate output that could technically be accurate, but those suggestions may not always be useful to the people or processes they support. Improving workflows or process efficiency may occur at different stages and require different metrics.

This article aims to highlight a few fundamental principles we’ve learned through our experiences in evaluation design. While it may not present an exhaustive list of evaluation methods, it seeks to bring critical considerations to light and offer a starting point for designing more robust and meaningful evaluation frameworks. In the absence of ground truth (GT) data, we often rely on user feedback to validate outputs and improve system performance. To make this process meaningful, we emphasize the need to move beyond binary feedback, such as thumbs up or down, to more structured and detailed input, particularly from experts, to better inform refinements and evaluation strategies.

Evaluating these layers — from technical accuracy to practical usability and efficiency — requires ongoing measurement and adjustments, forming a comprehensive evaluation strategy for AI products and systems.

In this article, we explore why evaluation design matters, challenges with feedback-based evaluation, and what you need to consider when iterating on your Large Language Model (LLM) solution.

Why does evaluation design matter?

Evaluation refers to the process of assessing how well the model-generated outputs align with user needs and expectations. A well-designed evaluation process helps to:

  • Ensure the realization of intended business value by aligning system outputs and performance to real world expectations.
  • Build trust with end users and stakeholders by ensuring solution reliability in diverse scenarios.
  • Identify areas for performance improvement by pinpointing systematic gaps in model performance.
  • Improve user satisfaction by allowing end users to provide feedback, and refining responses accordingly.
  • Adapt to real-world use cases and ensure stable system performance over time, as the iterative nature of evaluation helps it to remain relevant within changing and unexpected real-world conditions and to adjust promptly.

When designing an evaluation strategy, it’s important to consider various methods to account for this variety of goals and needs.

A feedback loop allows for the making of informed decisions when iterating on improvements.

Challenges with feedback-based evaluation

A common approach in deployed LLM systems is to rely on user feedback mechanisms like thumbs up or thumbs down buttons. Binary feedback mechanisms like these do not result in granular feedback that can help truly improve system and model performance and may also not be clear enough for users to express their reactions to system outputs. This simplistic approach particularly impacts the granularity of feedback and makes it difficult to capture nuanced insights effectively. While binary feedback mechanisms have limitations, it is also important to consider broader challenges such as variation in human judgment and bias in feedback that persist even with more advanced feedback collection methods. This section highlights these challenges in detail:

  • Lack of granularity in feedback: Binary feedback often fails to capture why a response was unsatisfactory — whether it lacked accuracy, completeness, or the right tone. Additionally, such feedback lacks nuance, as users may dislike an answer for stylistic reasons unrelated to performance. This highlights the need for more granular feedback options and open-ended responses. Furthermore, certain tasks, such as summarization or creative writing, inherently rely more on subjective interpretation than objective correctness. Evaluation strategies must therefore accommodate both subjective criteria (e.g., coherence, tone, and usefulness) and more objective measures (e.g., factual accuracy), ensuring flexibility for different use cases where correctness may not be well defined.
  • Accounting for variation in human judgment: Collecting human feedback without appropriate context and judgment can introduce variability that is difficult to interpret and understand. Not all who provide feedback are equally expert or share consistent expectations. Without qualitative context or a structured approach to interpreting and using this data, it can simply become noise, making it challenging to derive meaningful insights. Careful design of evaluation frameworks is necessary to filter and interpret user feedback effectively.
  • Bias in feedback: Emotions, prior experiences, and context may influence feedback, leading to skewed data. For example, feedback from someone with 10 years of experience versus someone new to the job may vary significantly, influencing evaluation outcomes. Evaluation strategies should account for such variations by weighing feedback from different user groups appropriately and targeting specific user groups for feedback collection based on the use case. Additionally, inputs should be normalized and contextualized to reduce biases and ensure fairness across evaluations.

To address these challenges, evaluation processes should integrate both subjective user perspectives and more objective measures, balancing automated metrics with qualitative insights to build a more robust evaluation framework.

What needs to be considered in evaluation design?

When deploying an LLM solution, an effective evaluation design goes beyond just collecting user feedback. This process involves, among other elements, defining clear objectives, establishing success criteria for different stakeholders, balancing subjective and objective measures, creating performance benchmarks, and considering ethical and compliance risks — as well as privacy concerns. Included below are some key aspects to consider.

1. Establishing business and user context and defining clear success metrics

To develop an effective evaluation strategy, it is crucial to first understand the context in which the product is to perform. This includes clearly identifying the different user roles and their related objectives. Who are the users of the system? What are their specific goals and needs? What do they value in the system? By gaining a comprehensive understanding of these elements, we can then define the success metrics that are most meaningful for evaluating the system’s performance.

Partnering with design researchers to collect insights and build this understanding, as well as analyzing the existing data, helps to build early understanding that can inform decisions around defining appropriate evaluation strategies. This understanding guides the establishment of clear metrics that reflect the specific objectives of stakeholders and ensures that success is measured appropriately across different contexts. For example, metrics such as accuracy, groundedness, relevance, and fluency can be defined and refined based on these objectives. By doing so, we can ensure a more granular and targeted evaluation of system outputs that are tailored to stakeholders’ needs.

2. Automated and human-in-the-loop evaluation

With human-in-the-loop evaluation, experts assess model outputs against specific quality metrics. This allows us to integrate human knowledge and domain understanding into the evaluation process. Human experts can provide more specific and granular insights, especially in complex or sensitive scenarios that require specialized knowledge.

Automating evaluation by using LLMs as a judge to rate generated content based on the selected criteria has become a common way to scale and accelerate evaluation efforts. However, relying solely on LLM-based evaluation comes with significant risks, as it inherits multiple biases resulting in text length, position, or format that may impact LLM judgment more than factual correctness or relevance.

Combining automated evaluation metrics (e.g., BLEU, ROUGE, or perplexity scores) with human feedback provides a holistic view on system performance. Periodic human-in-the-loop testing ensures that the model meets quality standards before deployment.

3. Explicit and implicit feedback

Explicit feedback is intentional feedback, often in response to direct questions or through structured forms of input. It helps in understanding the problem more deeply by asking a more specific question. Explicit feedback can come in different forms, e.g., ratings, surveys, or direct comments. In this context, explicit feedback also includes mechanisms integrated into the main task flow, such as thumbs up or thumbs down buttons, which are quick and easy for users to provide but may lack depth. Explicit feedback methods allow for more structured inputs that can capture user intent and preferences when designed effectively.

Implicit feedback can be inferred from user interactions with the system. It relies heavily on interpretation of user behavior. Examples of implicit feedback include measuring return use of the system, frequency and depth of interactions, specific tasks success rate, or other patterns in usage that may indicate user preferences. Implicit feedback collection requires careful planning, such as designing user experience to allow for editing, dismissing, or re-generating the model outputs.

4. Designing relevant and useful feedback collection mechanisms

Instead of defaulting to binary ratings, you can work to understand users and existing workflows and create meaningful ways for them to provide sufficiently granular feedback. This requires a balancing act between the ideal level of detail that might be useful from a data collection perspective and the level of effort required of users and how they might be incentivized to provide this feedback.

In going beyond the binary, you may think about meaningful categories for user feedback, accounting for what may be likely responses to system outputs. Those should be determined based on the process or workflow within which the system is used, and the profiles and existing mental models of users that are to provide feedback. Familiar examples from social media platforms include categories such as “it is not relevant” or “it is inappropriate.”

You may also consider adding an open-ended opportunity for users to make suggestions or elaborate on specific issues. This does, of course, require thought and effort about how this unstructured data is to be interpreted and processed. And as with any product, you should design an approach that also solicits qualitative feedback from users at regular intervals. All of this can provide context that structured feedback might miss. Balancing structured feedback with free-text responses provides a holistic evaluation framework, capturing both quantitative metrics and qualitative insights and ensuring the flexibility required to evaluate dynamic systems.

5. Offline and online evaluation

A common approach used in LLM development is to evaluate based on holdout data that the model has never seen before (historical or collected specifically for development purposes). While this approach allows teams to iterate on model improvements without exposing pre-production system versions to end users, it’s useful to treat offline evaluation as a tool to minimize risks before online testing.

Evaluating LLM performance with real users and iterating based on their input is a critical way to test model stability in real-life situations, track system health over time, and estimate business and user impact. Examples of techniques include:

  • A/B testing to compare new model versions or evaluation designs and see which delivers better outcomes.
  • Gradual rollout, which is when new model performance versions are released to a small portion of users and performance metrics are closely monitored.
  • Shadow mode release to allow evaluation of the model on real scenarios without exposing the outputs to the real users.

Additionally, online evaluation allows for the accounting of short-term results along with longer-term outcomes, such as learning and adaptation over time.

6. Subjectivity and variability

Evaluating LLMs involves inherent subjectivity and variability due to differences in user expectations, evaluators’ backgrounds, and evolving contexts. Addressing these factors is critical to ensure fair and reliable evaluations. Key considerations include:

  • Diverse user perceptions: Users may have varying expectations and standards for what constitutes a “good” response. Factors such as domain expertise, language proficiency, and cultural context heavily influence judgments. Identifying these different variables through research and including a representative and diverse cross-section of evaluators can help mitigate individual biases and provide a balanced assessment.
  • Metric selection and prioritization: Different metrics emphasize distinct aspects of performance, such as accuracy, relevance, fluency, or coherence. A well-rounded evaluation framework should incorporate a combination of quantitative and qualitative metrics to account for these variations and provide a holistic view of performance.
  • Adapting to contextual changes: As language usage, cultural norms, and user needs evolve, LLM responses must remain relevant. Continuous evaluation ensures that the model adapts to shifting contexts over time, maintaining its effectiveness.

7. Privacy considerations

The least obtrusive or most granular methods of data collection might conflict with user privacy regulations. We can address privacy concerns by adopting anonymization techniques and obtaining explicit user consent. Balancing granular data collection with ethical considerations is key to maintaining trust and legal compliance.

8. Testing with users and experts early and often

After defining the evaluation strategy and process, it’s important to test your thinking with a smaller group of users and iterate on the approach before rolling it out to a wider audience. User testing ensures the feedback interactions are understood as intended and are easy to follow. It can also reveal gaps or issues that may have been overlooked during planning. For example, users might interpret feedback categories or questions differently than anticipated, or they may reject AI outputs for reasons not considered in the evaluation plan. The insights from user testing help refine the process, making it more effective at capturing actionable feedback.

Conclusion

Deploying an LLM or Generative AI solution is not a one-and-done process. Relying solely on thumbs up or thumbs down feedback limits your ability to understand why a model succeeds or fails. A thoughtful evaluation design — one that combines metrics, qualitative user feedback, and iterative testing — is critical for improving model performance.

To build truly adoptable, user-centric AI solutions, we must look beyond simplistic feedback mechanisms, combine user insights with automated evaluations, and iterate continuously to address gaps and align with dynamic real-world demands. By designing better evaluation processes, we can ensure our AI systems are not just intelligent but also impactful and reliable for users.

Further reading

  • Towards Explainable Evaluation Metrics for Natural Language Generation: link
  • Evaluating the success of consumer generative AI products: link
  • Ranking Large Language Models without Ground Truth: link
  • A Survey of Human-in-the-loop for Machine Learning: link
  • Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks: link

--

--

No responses yet